Monday, June 12, 2017

Azure Machine Learning: Normalizing Features for Regression

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression, Boosted Decision Tree Regression, Poisson Regression and Model Evaluation phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.

In the previous post, we were able to calculate important evaluation statistics for our regression models, R Squared being the most important.  However, we left out a very important concept known as Normalization.

Many statistical algorithms (including some regression algorithms) attempt to determine the "best" model by reducing the variance of something (often the residuals).  However, this can be a problem when we are dealing with features on massively different scales.  Let's start by considering the calculation for variance.  The calculation starts by taking an individual value and subtracting the mean (also known as average).  This means that for very large values (like "Price" in our dataset), this difference will be very large, while for small values (like "Stroke" and "Bore" in our dataset), this difference will be very small.  Then, we square this value, making the difference even larger (and always positive).  Finally, we repeat this process for the rest of the values in the column, then add them together and divide by the number of records.

So, if we asked an algorithm to minimize this value across a number of different factors, we would find that it would almost always minimize the variance for the largest features, while completely ignoring the small features.  Therefore, it would be extremely helpful if we could take all of our features, and put them on the same scale.  This is what normalization does.  Let's take a look at the module in Azure ML.
Normalize Data
We can see that the "Normalize Data" module takes our data and applies a single "Transformation Method" to the columns of our choice.  In this experiment, we'll stick with using the ZScore transformation.  We may dig deeper into the other methods in a later post.  Also, we are choosing to exclude the "Price" column from our normalization.  In most cases, there's not much harm in normalizing the dependent variable.  However, we're withholding it for two reasons.  First, if we were to normalize the "Price" column, then we would get normalized predictions out of the model.  This would mean that we would have to reverse the transformation in order to get back to our original scale.  Second, Poisson Regression requires a positive whole number as the dependent variable, which is not the case with normalization, which can and will produce positive and negative values centered around 0.  Let's take a look at the visualization.
Normalize Data (Visualization)
We can see that these values are no longer large whole numbers like they were before.  Instead, they are small positive and negative decimals.  It's important to note that the Mean and Standard Deviation of these normalized features are very close to 0 and 1, respectively.  This is exactly what the ZScore transformation does.  However, the true purpose of this normalization is to see if it has any impact on our regression models.  Let's take a look.  For all of these evaluations, the unmodified values are used in the left model and normalized values are used in the right model.
Ordinary Least Squares Linear Regression
Online Gradient Descent Linear Regression
Boosted Decision Tree Regression
Poisson Regression
Before we began this experiment, we already knew that Linear and Boosted Decision Tree Regression were robust against normalization (meaning that normalizing the features wouldn't have an impact).  However, the MSDN article for Poisson Regression specifically states that we should normalize our features.  Given the underlying mathematics and the test we just conducted, we're not sure why this would be necessary.  If anyone has any ideas, feel free to leave a comment.  Alas, the point of this experiment is still valid.  There are some algorithms where Normalizing features ahead of time is necessary.  K-Means Clustering is one such algorithm.

With this in mind, we can conclusively say that Poisson Regression (without normalization) created the best model for our situation.  Hopefully, this experiment has enlightened you to all the ways in which you can use Regression in your organization.  Regression truly is one of the easiest techniques to use in order to gain tremendous value.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 15, 2017

Azure Machine Learning: Regression Model Evaluation

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression, Boosted Decision Tree Regression and Poisson Regression phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  While there are more regression techniques available in Azure ML, we're going to talk about the steps for evaluating our regression models.  Let's start by talking about Testing and Training Sets.

So far in our experiment, we've trained our models using the entire data set.  This is great because it gives the algorithm as much information as possible to make the best possible model.  Now, let's think about how we would test the model.  Testing the model, also known as model evaluation, requires that we determine how well the model would predict for data it hasn't seen yet.  In practice, we could easily train the model using the entire dataset, then use that same dataset to test the model.  However, that would determine how well the model can predict for data is has already seen, which is not the purpose of testing.

The most common approach to alleviate this issue is to split your data into two different sets, a Training Set and a Testing Set.  The Training Set is used to train the model and the Testing Set is used to test the model.  Using this methodology, we are testing the model with data it hasn't seen yet, which is the entire point.  To do this, we'll use the "Split Data" module.
Split Data
For this experiment, we'll be using "Split Rows" as our "Splitting Mode".  The other choices are more complex.  You can read about them here.

The "Fraction of Rows in the First Output Dataset" defines how many rows will be passed through the left output of the module.  In our case, we'll use .7 (or 70%) of our data for our Training Set and the remaining 30% for our Testing Set.  This is known as a 70/30 split and is generally considered the standard way to split.

The "Randomized Split" option is very important.  If we were to deselect this option, the first segment of our rows (70% in this case) would go through the left output and the last segment (30% in this case) would go through the right side.  Ordinarily, this is not what we would want.  However, there may be some obscure cases where you could apply a special type of sorting beforehand, then use this technique to split the data.  If you know of any other reasons, feel free to leave us a comment.

The "Random Seed" option allows us to split our data the same way every time.  This is really helpful for demonstration and presentation purposes.  Obviously, this parameter only matters if we have "Randomized Split" selected.

Finally, the "Stratified Split" option is actually quite significant.  If we select "False" for this parameter, we will be performing a Simple Random Sample.  This means that Azure ML will randomly choose a certain number of rows (70% in our case) with no regard for what values they have or what order they appeared in the original dataset.  Using this method, it's possible (and actually quite common) for the resulting dataset to be biased in some way.  Since the algorithm doesn't look at the data, it has no idea whether it's sampling too heavily from a particular category or not.
Price (Histogram)
If we look at the "Price" column in our dataset, we see that it is heavily right-skewed.  If we were to perform a Simple Random Sample on this data, it very possible that we could miss most (if not all) of the vehicles with high prices.  This would bias our sample toward vehicles with lower prices, potentially hurting our predictive ability.
Split Data (Stratified)
One way to alleviate this is to perform a Stratified Sample.  This can be accomplished by selecting "True" for the "Stratified Split" option.  Next, we need to select a column (or set of columns).  Basically, Azure ML will break the data up into segments based on the unique values in the select column (or combination of columns).  Then, it will perform a Simple Random Sample on each segment.  This can be extremely helpful if your data set has some extremely important factors that need to have the similar distributions in the Training Set and the Testing Set.  For instance, if we were trying to classify whether a car would sell (Yes or No), then it would be a very good idea to take a Stratified Sample over the "Sold" column.  However, this technique should not be used when the Stratification column has a large number of unique values.  This is the case with our price column.  Therefore, Stratification is not appropriate here.

As an interesting side note, we experimented with this technique and found that creating Stratified Samples over Continuous variables (like price) can have interesting results.  For instance, we built a dataset that contained a single column with the values 1 through 100 and no duplicates.  When we tried to pull a 50% Stratified Sample of this data, we found that Azure ML takes a sample at every possible value.  This means that it will try to take a 50% Simple Random Sample of a single row containing the value 5, for instance.  In every case, taking a sample of a single row will guarantee that the row gets returned, regardless of the sampling percentage.  Therefore, we ended up with all 100 rows in our left output and 0 rows in our right output, even though we wanted a 50/50 split.

Now that we've created our 70/30 split of the data, let's look at how it fits into the experiment.  We'll start by looking at the two different layouts, Train Model (which is used for Ordinary Least Squares Linear Regression) and Tune Model Hyperparameters (which is used for the other 3 regression algorithms).
Train Model

Tune Model Hyperparameters
We can see that the "Train Model" layout is pretty simple, the Training Set goes into the "Train Model" module and the Testing Set goes into the "Score Model" module.  However, in the "Tune Model Hyperparameters" layout, the Training Set and Testing Set both go in the "Tune Model Hyperparameters" module and the Testing Set also goes into the "Score Model" module.  Let's add all of these together and drop some "Evaluate Model" modules onto the end.
Model Evaluation
Unfortunately, we need two "Evaluate Model" modules because they can only accept two inputs.  Let's take a look at the visulization of the "Evaluate Model" module.
Ordinary Least Squares vs. Online Gradient Descent
Boosted Decision Tree vs Poisson


Some you may recognize that these visualizations look completely different than those created by the "Evaluate Model" module we used in our previous posts about ROC, Precision, Recall and Lift.  This is because the model evaluation metrics for Regression are completely different than those for Classification.  Looking at the Metrics section, we see that the first four metrics are Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error and Relative Squared Error.  All of these measures tell us how far our predictions deviate from the actual values they are supposed to predict.  We don't generally pay much attention to these, but we do want to minimize these in practice.

The measure we are concerned with is Coefficient of Determination, also known as R Squared.  We've mentioned this metric a number of times during this blog series, but never really described what it tells us.  Basically, R Squared tells us how "good" our  model is at predicting.  Higher values of R Squared are good and smaller values are bad.  It's difficult to determine what's an acceptable value.  Some people say .7 and other people say .8.  If we find anything lower than that, we might want to consider a different technique.  Fortunately for us, the Poisson Regression model has an R Squared of .97, which is extremely good.  As a side note, one of the reasons why we generally don't consider the values of the first four metrics is because they rarely differ from one another.  If R Squared tells you that one model is the best, it's likely that all of the other metrics will tell you the same thing.  R Squared simply has the advantage of being bounded between 0 and 1, which means we can identify not only which model is best, but also if that model is "good enough".

To finalize this evaluation, it seems that the Poisson Regression algorithm is the best model for this data.  This is especially interesting because, in the previous post, we commented on the fact that the Poisson Distribution is not theoretically appropriate for this data.  Given this, it may be a good idea to find an additional set of validation data to confirm our results.  Alas, that's beyond the scope of this experiment.

Hopefully, this discussion has opened your horizons to the possibilities of using Regression to answer some more complex business problems.  Stay tuned for the next post where we'll be talking about Normalizing Features for Regression.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 1, 2017

Azure Machine Learning: Regression Using Poisson Regression

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression and Boosted Decision Tree Regression phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  Specifically, we're going to be walking through the Poisson Regression algorithm.  Let's start by talking about Poisson Regression.

Poisson Regression is used to predict values that have a Poisson Distribution, i.e. counts within a given timeframe.  For example, the number of customers that enter a store on a given day may follow a Poisson Distribution.  Given that these values are counts, there are a couple of caveats.  First, the counts cannot be negative. Second, the counts could theoretically extend to infinity.  Finally, the counts must be Whole Numbers.

Just by looking at these three criteria, it may seem like Poisson Regression is theoretically appropriate for this data set.  However, the issue comes when we consider the mathematical underpinning of the Poisson Distribution.  Basically, the Poisson Distribution assumes that each entity being counted operates independently of the other entities.  Back to our earlier example, we assume that each customer entering the store on a given day does so without considering whether the other customers will be going to the store on that day as well.  Comparing this to our vehicle price data, that would be akin to saying that when a car is bought, each dollar independently decides whether it wants to jump out of the buyer's pocket and into the seller's hand.  Obviously, this is a ludicrous notion.  However, we're not theoretical purists and love bending rules (as long as they produce good results).  For us, the true test comes from the validation portion of the experiment, which we'll cover in a later post.  If you want to learn more about Poisson Regression, read this and this.  Let's take a look at the parameters for this module.
Poisson Regression
The Poisson Regression algorithm uses an optimization technique known as Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS).  Basically, this technique tries to find the "best" set of parameters to fill in our Poisson Regression equation, which is described in detail here.  In practice, the smaller we make "Optimization Tolerance", the longer the algorithm will take to train and the more accurate the results should be.  This value can be optimized using the "Tune Model Hyperparameters" module.

Without going into too much depth, the "L1 Regularization Weight" and "L2 Regularization Weight" parameters penalize complex models.  If you want to learn more about Regularization, read this and this.  As with "Optimization Tolerance", Azure ML will choose this value for us.

"Memory Size for L-BFGS" specifies the amount of memory allocated to the L-BFGS algorithm.  We can't find much more information about what effect changing this value will have.  Through some testing, we did find that this value had very little impact on our model, regardless of how large or small we made it (the minimum value we could provide is 1).  However, if our data set had an extremely large number of columns, we may find that this parameter becomes more significant.  Once again, we do not have to choose this value ourselves.

The "Random Number Seed" parameter allows us to create reproducible results for presentation/demonstration purposes.  Oddly enough, we'd expect this value to play a role in the L-BFGS algorithm, but it doesn't seem to.  We were unable to find any impact caused by changing this value.

Finally, we can choose to deselect "Allow Unknown Categorical Levels".  When we train our model, we do so using a specific data set known as the training set.  This allows the model to predict based on values it has seen before.  For instance, our model has seen "Num of Doors" values of "two" and "four".  So, what happens if we try to use the model to predict the price for a vehicle with a "Num of Doors" value of "three" or "five"?  If we leave this option selected, then this new vehicle will have its "Num of Doors" value thrown into an "Unknown" category.  This would mean that if we had a vehicle with three doors and another vehicle with five doors, they would both be thrown into the same "Num of Doors" category.  To see exactly how this works, check out of our previous post, Regression Using Linear Regression (Ordinary Least Squares).

The options for the "Create Trainer Mode" parameter are "Single Parameter" and "Parameter Range".  When we choose "Parameter Range", we instead supply a list of values for each parameter and the algorithm will build multiple models based on the lists.  These multiple models must then be whittled down to a single model using the "Tune Model Hyperparameters" module.  This can be really useful if we have a list of candidate models and want to be able to compare them quickly.  However, we don't have a list of candidate models, but that actually makes "Tune Model Hyperparameters" more useful.  We have no idea what the best set of parameters would be for this data.  So, let's use it to choose our parameters for us.
Tune Model Hyperparameters
Tune Model Hyperparameters (Visualization)
We can see that there is very little difference between the top models using Coefficient of Determination, also known as R Squared.  This is a great sign because it means that our model is very robust and we don't have to sweat over choosing the perfect parameters.

On a side note, there is a display issue causing some values for the "Optimization Tolerance" parameter to display as 0 instead of whatever extremely small value they actually are.  This is disappointing as it limits our ability to manually type these values into the "Poisson Regression" module.  One of the outputs from the "Tune Model Hyperparameters" module is the Trained Best Model, whichever model appears at the top of list based on the metric we chose.  This means that we can use this as an input into other modules like "Score Model".  However, it does mean that we cannot use these parameters in conjunction with the "Cross Validate Model" module as that requires an Untrained Model as an input.  Alas, this is not a huge deal because we see that "Optimization Tolerance" does not have a very large effect on the resulting model.
All Regression Models Complete
Hopefully we've laid the groundwork for you to understand Poisson Regression and utilize it in your work.  Stay tuned for the next post where we'll be talking about Regression Model Evaluation.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 17, 2017

Azure Machine Learning: Regression Using Boosted Decision Tree Regression

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression and Online Gradient Descent Linear Regression phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  Specifically, we're going to be walking through the Boosted Decision Tree Regression algorithm.  Let's start by talking about Decision Tree Regression.
Example Decision Tree
For this example, we used our data set to create a single decision tree with 5 leaves (circled in blue).  Basically, a decision tree is built by choosing the "best" split for each node, then repeating this process until the specified number of leaves is reached.  There's a complex process for choosing the "best" threshold that we won't go into detail on.  You can read more about this process here, here and here.

In our example, we start with the entire dataset.  Then, a single threshold is defined for a single variable that defines the "best" split.  Next, the data is split across the two branches.  On the false side, the algorithm decides that there's not enough data to branch again or the branch would not be "good" enough, so it ends in a leaf.  This means that any records with a value of "Engine Size" > 182 will be assigned a "Predicted Price" of $6,888.41.
Example Decision Tree Predictions
We can see that this is the case by looking at the scored output.  Next, the records with "Engine Size" <= 182 will be passed to the next level and the process will repeated until there are exactly 5 leaves.  Later in this post, we'll see where the 5 came from.

Now, we've talked about "Decision Tree Regression", but what's "Boosted Decision Tree Regression"?  The process of "Boosting" involves creating multiple decision trees, where each decision tree depends on those that were created before it.  For example, assume that we want to build 3 boosted decision trees.  The first decision tree would attempt to predict the price for each record.  The next tree would be calculated using the same algorithm, but instead of predicting price, it would try to predict the difference between the actual price and the price predicted by the previous tree.  This is known as a residual because it is what's "left over" after the tree is built..  For example, if the first tree predicted a price of 10k, but the actual price was 12k, then the second tree would be trying to predict 12k - 10k = 2k instead of the original 12k.  This way, when the algorithm is finished, the predicted price can be calculated simply by running the record through all of the trees, and adding all of the predictions.  However, the process for building the individual trees is more complicated than this example would imply.  You can read about it here and here.  Let's take a look at the "Boosted Decision Tree Regression" module in Azure ML.
Boosted Decision Tree Regression
Some of you may notice that these are the same parameters used in the "Two-Class Boosted Decision Tree" module, which we covered in an earlier post.  Because of the similarity, some of the following descriptions have been lifted from that post.

The "Maximum Number of Leaves per Tree" parameter allows us to set the number of times the tree can split.  It's important to note that splits early in the tree are caused by the most significant predictors, while splits later in the tree are less significant.  This means that the more leaves we have (and therefore more splits), the higher our chance of Overfitting is.  We'll talk more about this in a later post.

The "Minimum Number of Samples per Leaf Node" parameters allows us to set the significance level required for a split to occur.  With this value set at 10, the algorithm will only choose to split (this is known as creating a "new rule") if at least 10 rows, or observations, will be affected.  Increasing this value will lead to broad, stable predictions, while decreasing this value will lead to narrow, precise predictions.

The "Learning Rate" parameter allows us to set how much difference we see from tree to tree.  MSDN describes this quite well as "the learning rate determines how fast or slow the learner converges on the optimal solution. If the step size is too big, you might overshoot the optimal solution. If the step size is too small, training takes longer to converge on the best solution."

The "Random Number Seed" parameter allows us to create reproducible results for presentation/demonstration purposes.  Since this algorithm is not random, this parameter has no impact on this module.

Finally, we can choose to deselect "Allow Unknown Categorical Levels".  When we train our model, we do so using a specific data set known as the training set.  This allows the model to predict based on values it has seen before.  For instance, our model has seen "Num of Doors" values of "two" and "four".  So, what happens if we try to use the model to predict the price for a vehicle with a "Num of Doors" value of "three" or "five"?  If we leave this option selected, then this new vehicle will have its "Num of Doors" value thrown into an "Unknown" category.  This would mean that if we had a vehicle with three doors and another vehicle with five doors, they would both be thrown into the same "Num of Doors" category.  To see exactly how this works, check out of our previous post, Regression Using Linear Regression (Ordinary Least Squares).

The options for the "Create Trainer Mode" parameter are "Single Parameter" and "Parameter Range".  When we choose "Parameter Range", we instead supply a list of values for each parameter and the algorithm will build multiple models based on the lists.  These multiple models must then be whittled down to a single model using the "Tune Model Hyperparameters" module.  This can be really useful if we have a list of candidate models and want to be able to compare them quickly.  However, we don't have a list of candidate models.  Strangely, that actually makes "Tune Model Hyperparameters" more useful.  We have no idea what the best set of parameters would be for this data.  So, let's use it to choose our parameters for us.
Tune Model Hyperparameters
Tune Model Hyperparameters (Visualization)
We can see that the "Tune Model Hyperparameters" module will test quite a few different models in order to determine the best combination of parameters.  In this case, it found that a set of 44 trees with 4 leaves, a significance level of 2 records per leaf and a learning rate of .135 has the highest Coefficient of Determination, also known as R Squared.
OLS/OGD Linear Regression and Boosted Decision Tree Regression
Now, our experiment contains 3 different regression methods for predicting the price of a vehicle.  Hopefully, this post opened the door for you to utilize Boosted Decision Tree Regression in your own work.  Stay tuned for our next post where we'll talk about Poisson Regression.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 3, 2017

Azure Machine Learning: Regression Using Linear Regression (Online Gradient Descent)

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation and Ordinary Least Squares Linear Regression phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  Specifically, we're going to be walking through the Online Gradient Descent algorithm for Linear Regression.  Let's take a look at the parameters.
Linear Regression (OGD)
The "Online Gradient Descent" algorithm creates multiple regression models based on the parameters we provide.  The goal of these regression models is to find the model that fits the data "best", i.e. has the smallest "loss" according a particular loss function.  As with all of the iterative procedures, we trade additional training time and the possibility of getting "stuck" in local extrema in exchange for the change of finding a better solution.  You can read more about OGD here.
Gradient
Here's a visual representation of a gradient that we pulled from Wikipedia.  This valley represents the "loss" or "error" of our model.  Lower "loss" means a better model.  The goal of the OGD algorithm is find the bottom of the valley in the picture.  Basically, a starting point is randomly assigned and the algorithm simply tries to keep going downhill by building new models.  If it keeps going downhill, it will eventually reach the bottom of the valley.  However, things get much more complicated when you have multiple valleys (what if you get to the bottom of one valley, but there's an even deeper valley next to it?).  We'll touch on this in a minute.  Let's walk through the parameters.

"Learning Rate" is also known as "Step Size".  As the algorithm is trying to go downhill, it needs to know how far to move each time.  This is what "Learning Rate" represents.  A smaller step size would mean that we are more likely to find the bottom of the valley, but it also means that if we stuck in a valley that isn't the deepest, we may not be able to get out.  Conversely, a larger step size would mean that we can more easily find the deepest valley, but may not be able to find the bottom of it.  Fortunately, we can let Azure ML choose this value for us based on our data.

"Number of Training Epochs" defines how many times the algorithm will go through this learning process.  Obviously, the larger the number of iterations, the longer the training process will take.  Also, larger values could potentially lead to overfitting.  As with the other parameters, we don't have to choose this ourselves.

Without going into too much depth, the "L2 Regularization Weight" parameter penalizes complex models in favor of simpler ones.  Fortunately, there's a way that Azure ML will choose this value for us.  So, we don't need to worry too much about it.  If you want to learn more about Regularization, read this and this.

According to MSDN, "Normalize Features" allows us to "indicate that instances should be normalized".  We're not quite sure what this is supposed to mean.  There is a concept in regression of normalizing, also known as standardizing, the inputs.  However, we did some testing and were not able to find a situation where this feature had any effect on the results.  Please let us know in the comments if you know of one.

"Average Final Hypothesis" is much more complicated.  Here's the description from MSDN:
In regression models, hypothesis testing means using some statistic to evaluate the probability of the null hypothesis, which states that there is no linear correlation between a dependent and independent variable. 
In many regression problems, you must test a hypothesis involving more than one variable. This option, which is selected by default, tests a combination of the parameters where two or more parameters are involved.
This seems to imply that utilizing the "Average Final Hypothesis" option takes into account the interactions between factors, instead of assuming they are independent.  Interestingly,  deselecting this option seems to generally produce better models.  However, it also has an unwritten size limit.  If we deselect this option and try to train the model using too many rows or columns, it will throw an error.  Therefore, we can say that deselecting this option is extremely useful in some cases.  We'll have to try it case-by-case to decide when it is appropriate and when it is not.

The "Decrease Learning Rate" option allows Azure ML to decrease the Learning Rate (aka Step Size) as the number of iterations increases.  This allows us to hone in on an even better model by allowing us to find the tip of the valley.  However, reducing the learning rate also increases the chances that we get stuck in a local minima (one of the shallow valleys).  Deselecting this option is susceptible to the same size limitation as "Average Final Hypothesis", but doesn't seem to have the same positive impact.  Unless we can find a good reason, let's leave this option selected for the time being.

Choosing a value for "Random Number Seed" defines our starting point and allows us to create reproducible results in case we need to use them for demonstrations or presentations.  If we don't provided a value for this parameter, one will be randomly generated.

Finally, we can choose to deselect "Allow Unknown Categorical Levels".  When we train our model, we do so using a specific data set known as the training set.  This allows the model to predict based on values it has seen before.  For instance, our model has seen "Num of Doors" values of "two" and "four".  So, what happens if we try to use the model to predict the price for a vehicle with a "Num of Doors" value of "three" or "five"?  If we leave this option selected, then this new vehicle will have its "Num of Doors" value thrown into an "Unknown" category.  This would mean that if we had a vehicle with three doors and another vehicle with five doors, they would both be thrown into the same "Num of Doors" category.  To see an example of this, read our previous post.

Now that we've walked through all of the parameters, we can use the "Tune Model Hyperparameters" module to choose them for us.  However, it will only choose values for "Learning Rate", "Number of Epochs" and "L2 Regularization Weight".  Since we also found that "Average Final Hypothesis" was significant, we should create two separate streams, one with this option selected and one without.  For an explanation of how "Tune Model Hyperparameters" works, read one our previous posts.
Tune Model Hyperparameters

Tune Model Hyperparameters (Visualization) (Average Final Hypothesis)

Tune Model Hyperparameters (Visualization) (No Average Final Hypothesis)
The first thing we noticed is that when we deselect the "Average Final Hypothesis" option, we lose 6 rows because of the size limit.  Certain combinations of parameters caused the model to fail.  Fortunately, the "Tune Model Hyperparameters" module is smart enough to throw them out.  In fact, not all 14 of the remaining rows are valid.  For instance, the last row has a Coefficient of Determination (R Squared) of -5e51.  That's -5 followed by 51 zeroes.  That's obviously an error considering that R Squared is supposed to be bounded between 0 and 1.

Also, it's important to note that the best model with "Average Final Hypothesis" has an R squared of .678 compared to an R squared of .769 without.  This is what we meant when we said that disabling the option seems to produce better results.  Unfortunately, our lack of familiarity with what the parameter actually does means that we're not sure if this result is valid or not.  The best we can do is assume that it is.  If you have any information on this, please let us know.
OLS and OGD Linear Regression
So far in this experiment, we've covered the basics of Data Import and Cleansing, Ordinary Least Squares Linear Regression and Online Gradient Descent Linear Regression.  However, that's just the tip of the iceberg.  Stay tuned for the next post where we'll be covering Boosted Decision Tree Regression.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, March 20, 2017

Azure Machine Learning: Regression Using Linear Regression (Ordinary Least Squares)

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous post, we walked through the initial data load and imputation phases of the experiment.
Initial Data Load and Imputation
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  One way to do this is through a technique called Regression.  Basically, regression is a technique for predicting a numeric value (or set of values) based on a series of numeric inputs.

Now, some of you might be asking what happens to the non-numeric text data.  Turns out, they get converted into numeric variables using a technique called Indicator Variables (also known as Dummy Variables).  With this technique, every text field gets broken down into multiple binary (0/1) fields, each representing a single unique value from the original field.  For instance, the Num of Doors fields takes the values "two" and "four".  Therefore, the Indicator Variables for this field would be "Num of Doors = two" and "Num of Doors = four".  Each of these fields takes a value of 1 if the original field contains the value in question, and 0 if it doesn't.  To continue our example, a vehicle with "Num of Doors" = "two" would have a value of 1 in the "Num of Doors = two" field and a value of 0 in the "Num of Doors = four" field.
Indicator Variables Example
Things actually get a little more complicated when you are dealing with Unknown/NULL values.  The specific technique used varies based on the tool, but rarely has any effect.  We'll see how things works for Linear Regression later in this post.  As a side note, not all modules will automatically convert your fields to Indicator Variables.  In these cases, Azure ML has a module called Convert to Indicator Values that will do this for you.  If we need finer control over exactly how it accomplishes this, we could also use a SQL, R or Python script to handle it.  Let's move on to Linear Regression.

Earlier, we mentioned that Regression is a technique for predicting numeric values using other numeric values.  Linear Regression is a subset of Regression that creates a very specific type of model.  Let's say that we are trying to predict a value x by using values y and z.  A linear regression algorithm will create a model that looks like x = a*y + b*z + c, where a, b and c are called "coefficients", also known as "weights".  Now, this relationship looks linear from the coefficients' perspectives (meaning that there are no exponents, trigonometric functions, etc.).  However, if were to alter our data set so that z = y^2, then we would end up with the model x = a*y + b*y^2 + c.  This is LINEAR from the coefficients' perspectives, but is PARABOLIC from variables' perspectives.  This is one of the major reasons why Linear Regression is so popular.  It's very easy to build, train and comprehend, but is virtually limitless in the amount of relationships it can handle.  Let's take a look at the parameters.
Linear Regression (OLS)
We see that there are two options for "Solution Method".  The first, and most common, method is "Ordinary Least Squares" (OLS).  This method is the one most commonly taught because it has almost no parameters to tinker with.  We basically toss our data at it and it runs.  OLS is also very efficient because the entire algorithm is just a short series of linear algebra operations and only runs through the data once.  You can learn more about OLS here.

The second option for "Solution Method" is "Online Gradient Descent".  This method is substantially more complicated and will be covered in the next post.

Without going into too much depth, the "L2 Regularization Weight" parameter penalizes complex models.  Unfortunately, the "Tune Model Hyperparameters" module will not choose this value for us.  On the other hand, we tried a few values and did not find it to have any significant impact on our model.  If you want to learn more about Regularization, read this and this.

We can also choose "Include Intercept Term".  If we deselect this option, then our model will change from x = a*y + b*z + c to x = a*y + b*z.  This means that when all of our factors are 0, then our prediction would also be zero.  Honestly, we've never found a reason, in school or in practice, why we would ever want to deselect this option.  If you know of any, please let us know in the comments.

Next, we can choose a "Random Number Seed".  Most machine learning algorithms are random by nature.  That means their "starting point" matters.  Running the algorithm multiple times will produce different results.  However, the OLS algorithm is not random.  We tested and confirmed that this parameter has no impact on this algorithm.

Finally, we can choose to deselect "Allow Unknown Categorical Levels".  When we train our model, we do so using a specific data set known as the training set.  This allows the model to predict based on values it has seen before.  For instance, our model has seen "Num of Doors" values of "two" and "four".  So, what happens if we try to use the model to predict the price for a vehicle with a "Num of Doors" value of "three" or "five"?  If we leave this option selected, then this new vehicle will have its "Num of Doors" value thrown into an "Unknown" category.  This would mean that if we had a vehicle with three doors and another vehicle with five doors, they would both be thrown into the same "Num of Doors" category.  We'll see exactly how this works when we look at the indicator variables.

Now that we know understand the parameters behind the OLS module, let's look at the results of the "Train Model" module.
Train Model
Train Model (Visualization)


We can see that the visualization is made up of two sections, "Settings" and "Feature Weights".  The "Settings" section simply shows us what parameters we set in the module.  The "Feature Weights" section shows us all of the independent variables (everything except what we were trying to predict, which was Price) as well as their "Weight" or "Coefficient".  Positive weights mean that the value has a positive effect on price and Negative weights mean that the value has a negative effect on price.  Let's take a closer look at some of the different features.
Features
We can see that there are quite a few different features in this model.  We've pulled out a few and color coded them for clarity.  Let's start with "Bias".  Remember back to our model equation, x = a*y + b*z + c.  The "Bias" value corresponds to c in our equation.  This tells us that if all of our other factors were 0 (which is impossible for some of our factors), the price of our car would be -$6,008.57.  Obviously, this is a silly value.  Bias, also known as the intercept, is not generally a useful value by itself.  However, it does greatly improve the fit of our models and can be utilized by more advanced techniques.

Next, let's take a look at the features in Grey.  These are all numeric features.  We can tell because they don't have any underscores (_) or pound signs (#) in them.  We see that cars with an additional "Width" of 1, would also have an additional price of $600.89.  We can also see that vehicles will larger values of "Bore" and "Stroke" have lower prices.

Let's move on to the features in Blue.  These are the Indicator Variables we've mentioned a couple of time.  In our original data set, we included a feature called "body-style".  This feature had the values "convertible", "hardtop", "hatchback", "sedan" and "wagon".  Therefore, when the Linear Regression module needed to convert these to Indicator Variables, it used an extremely simple method.  It created new fields with the titles of "<field name>_<field value>_<index>".  The <field name> and <field value> are pulled directly from the record, while <index> is created by ordering the values (notice how they are in alphabetical order?) and counting up from 0.

Now, since we didn't deselect the "Allow Unknown Categorical Levels" option, we have an additional feature for each text field.  This field is named "<field name>#unknown_<index>".  This is the additional category that any new values from the testing set would be thrown into.  Currently, we're not quite sure how it assigns a weight to a value it hasn't seen.  If you know, please let us know in the comments.  It's also interesting to note that the index for the unknown category is not calculated correctly.  It appears to be calculated as [Number of Values] + 1.  However, since indexes start counting at 0 instead of 1, our index is always one larger than it should be.  For instance, the indexes for the "num-of-doors" fields are 0, 1, 2 and 4.

Finally, let's take a look at the "num-of-doors" fields in Purple.  In the previous post <INSERT LINK HERE>, we had some missing values in the "num-of-doors" field.  These values were replaced with a value of "Unknown".  Since "Unknown" is a valid value in our data set, we end up with two different unknown fields in our final result, "num-of-doors_unknown_2" (defined by us) and "num-of-doors#unknown_4" (defined by the algorithm).  This isn't significant; it's just interesting.

As a final note, if we were to perform Linear Regression in other tools, we would be able to access a summary table telling us whether each individual variable was "statistically significant".  For instance, here's a sample R output we pulled from Google.


Call:
lm(formula = a1 ~ ., data = clean.algae[, 1:12])

Residuals:
  Min      1Q  Median      3Q     Max 
  -37.679 -11.893  -2.567   7.410  62.190 

  Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
  (Intercept)  42.942055  24.010879   1.788  0.07537 . 
  seasonspring  3.726978   4.137741   0.901  0.36892   
  seasonsummer  0.747597   4.020711   0.186  0.85270   
  seasonwinter  3.692955   3.865391   0.955  0.34065   
  sizemedium    3.263728   3.802051   0.858  0.39179   
  sizesmall     9.682140   4.179971   2.316  0.02166 * 
  speedlow      3.922084   4.706315   0.833  0.40573   
  speedmedium   0.246764   3.241874   0.076  0.93941   
  mxPH         -3.589118   2.703528  -1.328  0.18598   
  mnO2          1.052636   0.705018   1.493  0.13715   
  Cl           -0.040172   0.033661  -1.193  0.23426   
  NO3          -1.511235   0.551339  -2.741  0.00674 **
  NH4           0.001634   0.001003   1.628  0.10516   
  oPO4         -0.005435   0.039884  -0.136  0.89177   
  PO4          -0.052241   0.030755  -1.699  0.09109 . 
  Chla         -0.088022   0.079998  -1.100  0.27265   
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1   1

  Residual standard error: 17.65 on 182 degrees of freedom
  Multiple R-squared:  0.3731,    Adjusted R-squared:  0.3215  
  F-statistic: 7.223 on 15 and 182 DF, p-value: 2.444e-12

Using this table, we can find out which variables are useful and which are not.  Unfortunately, we were not able to find a way to create this table using any of the built-in modules.  We could certainly use and R or Python script to do it, but that's beyond the scope of this post.  Once again, if you have any insight, please share it with us.

Hopefully, this post enlightened you to the possibilities of OLS Linear Regression.  It truly is one of the easiest, yet most powerful techniques in all of Data Science.  It's made even easier by its use in Azure Machine Learning Studio.  Stay tuned for the next post where we'll dig into the other type of Linear Regression, Online Gradient Descent.  Thanks for reading.  We hope you found this informative.


Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com