|Sample 4: Cross Validation for Regression: Auto Imports Dataset|
|Automobile Price Data (Raw) (Visualization) 1|
|Automobile Price Data (Raw) (Visualization) 2|
Now, if the salesperson were able to accurately predict the price a customer would pay for the vehicle, then he or she could maximize his or her profit by selling for that amount. Conversely, if the customer were able to accurately predict the price another customer would pay for the vehicle, then he or she would know whether the current price is a good deal or not. Obviously, this type of information would be very beneficial for anyone who could create it. Let's see if we can do it!
Before we move on, we should note that we have no idea what the "Symboling" and "Normalized Losses" columns mean. In practice, we should never model with variables that we don't understand. So, we found this snippet.
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars.
The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.Let's look at the visualization for "Symboling".
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.
|Symboling (Histogram) (5 Bars)|
|Symboling (Histogram) (6 Bars)|
|Select Columns in Dataset|
|Clean Missing Data|
|Expenses by Month|
As you can see, replacing "Non-Existent" values is simple. However, what are our options for replacing "Unknown" values? The most common method is to replace the missing values with a "common" value from the same column. The three prominent methods for defining "common" are Mean, Median and Mode. All three of these options are available within Azure ML. We prefer to use Median as it is less susceptible to outliers than Mean and a better indicator of "centrality" than Mode.
On the other hand, what if accuracy is a major concern and missing values are not acceptable? We have two options for that, removing the row or removing the column. If a particular set of rows has missing values across many columns, then it may be a good decision to remove those rows. Conversely, if many rows have missing values across a particular set of columns, then it may be a good decision to remove those columns. The decision of whether to remove rows, columns, or neither is based heavily on subject matter expertise. It is very easy to bias our dataset or lose valuable accuracy by removing too many rows or columns.
Finally, what if the previously described techniques are killing the accuracy of our model? Azure ML has two more advanced options for us, MICE and Probabilistic PCA. Instead of trying to explain these concepts ourselves, we'll pull from the Azure documentation. Here's the description of MICE.
For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as Multivariate Imputation using Chained Equations or Multiple Imputation by Chained Equations.
In a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. In contrast, in a single imputation method (such as replacing a missing value with a column mean) a single pass is made over the data to determine the fill value.
All imputation methods introduce some error or bias, but multiple imputation better simulates the process generating the data and the probability distribution of the data.
For a general introduction to methods for handling missing values, see Missing Data: the state of the art. Schafer and Graham, 2002.Here's the description of Probabilistic PCA.
Replaces the missing values by using a linear model that analyzes the correlations between the columns and estimates a low-dimensional approximation of the data, from which the full data is reconstructed. The underlying dimensionality reduction is a probabilistic form of Principal Component Analysis (PCA), and it implements a variant of the model proposed in the Journal of the Royal Statistical Society, Series B 21(3), 611–622 by Tipping and Bishop.
Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. It may therefore offer better performance for datasets that have missing values in many columns.
The key limitations of this method are that it expands categorical columns into numerical indicators and computes a dense covariance matrix of the resulting data. It also is not optimized for sparse representations. For these reasons, datasets with large numbers of columns and/or large categorical domains (tens of thousands) are not supported due to prohibitive space consumption.Simply put, if you have a large amount of missing data, you should consider using on of these techniques. Probabilistic PCA is better for dense, parametric datasets and MICE is better for sparse and/or non-parametric datasets.
With all of this in mind, let's look back at our data. In order to determine the best imputation, we'll use the "Summarize Data" module.
|Summarize Data (Visualization) 1|
|Summarize Data (Visualization) 2|
The visualization for the Summarize Data module gives us some summary statistics about each column, including how many missing values it has. The columns with missing values are shown in the preceding pictures. The first thing to note is that the "Price" column has 4 missing values. Since we are attempting to predict "Price", it would not be appropriate to impute values into that column. So, let's start by removing those 4 rows and see if we still have more missing values.
|Remove Rows with Missing Price|
|Remove Rows with Missing Price (Visualization) 1|
|Remove Rows with Missing Price (Visualization) 2|
|Num of Doors|
As an interesting side note here, when we were using this example in a presentation, a knowledgeable car enthusiast suggested that we use the body type of the car to determine what we should replace these missing values with. This just goes to show that domain expertise can be very helpful when you are building data science solutions.
|Replace Missing Num of Doors with Unknown|
|Replace Missing Num of Doors with Unknown (Visualization)|
|Replace Missing Numeric Values with Median|