Monday, September 4, 2017

Azure Machine Learning in Practice: Model Selection

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation and Data Cleansing phases of the experiment.  In this post, we're going to walk through the model selection process.

In traditional machine learning and data science applications, model selection is a time-consuming process that generally requires a significant amount of statistical background.  Azure Machine Learning completely breaks this paradigm.  As you will see in the next few posts, model selection is Azure Machine Learning requires nothing more than a basic understanding of the problem we are trying to solve and a willingness to let the data pick our model for us.  Let's take a look at our experiment so far.
Experiment So Far
We can see that we've already imported our data and decided to use two different imputation methods, MICE and Probabilistic PCA.  Now, we need to select which models we would like to use to solve our problem.  It's important to remember that our goal is predict when a transaction is fraudulent, i.e. has a "Class" value of 1.  Before we do that, we should remember to remove the "Row Number" feature from our dataset, as it has no analytical value.
Select Columns in Dataset
Now, let's take a look at our model options.
Initialize Model
Using the toolbox on the left side of the Azure Machine Learning Studio, we can work our way down to the "Initialize Model" section.  Here, we have four different types of models, "Anomaly Detection", "Classification", "Clustering" and "Regression".

"Anomaly Detection" is the area of Machine Learning where we try to find things that look "abnormal".  This is an especially difficult task because it requires defining what's "normal".  Fortunately, Azure ML has some great tools that handle the hard work for us.  These types of models are very useful for Fraud Detection in areas like Credit Card and Online Retail transactions, as well Fault Detection in Manufacturing.  However, our training data already has fraudulent transactions labelled.  Therefore, Anomaly Detection may not be what we're looking for.  However, one of the great things about Data Science is that there are no right answers.  Feel free to add some Anomaly Detection algorithms to the mix if you would like.

"Classification" is the area of Machine Learning where we try to determine which class a record belongs to.  For instance, we can look at information about a person and attempt to determine where they are likely to buy a particular product.  This technique requires that we have an initial set of data where already know the classes.  This is the most commonly used type of algorithm and can be found in almost every subject area.  It's not coincidence that our variable of interest in this experiment is called "Class".  Since we already know whether each of these transactions was fraudulent or not, this is a prime candidate for a "Classification" algorithm.

"Clustering" is the area of Machine Learning where we try to group records together to identify which records are "similar".  This is a unique technique belonging to a category of algorithms known as "Unsupervised Learning" techniques.  They are unsupervised in the sense that we are not telling them what to look for.  Instead, we're simply unleashing the algorithm on a data set to see what patterns it can find.  This is extremely useful in Marketing where being able to identify "similar" people is important.  However, it's not very useful for our situation.

"Regression" is the area of Machine Learning where try to predict a numeric value by using other attributes related to it.  For instance, we can use "Regression" techniques to use information about a person to predict their salary.  "Regression" has quite a bit in common with "Classification".  In fact, there are quite a few algorithms that have variants for both "Classification" and "Regression".  However, our experiment only wants to predict a binary (1/0) variable.  Therefore, it would be inappropriate to use a "Regression" algorithm.

Now that we've decided "Classification" is the category we are looking for, let's see what algorithms are underneath it.
Classification
For the most part, we can see that there are two types of algorithms, "Two-Class" and "Multiclass".  Since the variable we are trying to predict ("Class") only has two values, we should use the "Two-Class" algorithms.  But which one?  This is the point where Azure Machine Learning really stands out from the pack.  Instead of choosing one, or even a few, algorithms, we can try them all.  In total, there are 9 different "Two-Class Classification" algorithms.  However, in the next post, we'll be looking at the "Tune Model Hyperparameters" module.  Using this module, we'll find out that there are actually 14 distinct algorithms, as some of the algorithms have a few different variations and one of the algorithms doesn't work with "Tune Model Hyperparameters".  Here's the complete view of all the algorithms.
Two-Class Classification Algorithms
For those that may have issues seeing the image, here's a list.

Two-Class Averaged Perceptron
Two-Class Boosted Decision Tree
Two-Class Decision Forest - Resampling: Replicate
Two-Class Decision Forest - Resampling: Bagging
Two-Class Decision Jungle - Resampling: Replicate
Two-Class Decision Jungle - Resampling: Bagging
Two-Class Locally-Deep Support Vector Machine - Normalizer: Binning
Two-Class Locally-Deep Support Vector Machine - Normalizer: Gaussian
Two-Class Locally-Deep Support Vector Machine - Normalizer: Min-Max
Two-Class Logistic Regression
Two-Class Neural Network - Normalizer: Binning
Two-Class Neural Network - Normalizer: Gaussian
Two-Class Neural Network - Normalizer: Min-Max
Two-Class Support Vector Machine

A keen observer may notice that the "Two-Class Bayes Point Machine" model was not included in this list.  For some reason, this model cannot used in conjunction with "Tune Model Hyperparameters".  However, we will handle this in a later post.

Hopefully, this post helped shed some light on "WHY" you would choose certain models over others.  We can't stress enough that the path to success is to let the data decide which model is best, not "rules-of-thumb" or theoretical guidelines.  Stay tuned for the next post, where we'll use large-scale model evaluation to pick the best possible model for our problem.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, August 14, 2017

Azure Machine Learning in Practice: Data Cleansing

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous post, it's highly recommended that you do, as it provides valuable context.  In this post, we're going to walk through the data cleansing process.

Data Cleansing is arguably one of the most important phases in the Machine Learning process.  There's an old programming adage "Garbage In, Garbage Out".  This applies to Machine Learning even more so.  The purpose of data cleansing is to ensure that the data we are using is "suitable" for the analysis we are doing.  "Suitable" is an amorphous term that takes on drastically different meanings based on the situation.  In our case, we are trying to accurately identify when a particular credit card purchase is fraudulent.  So, let's start by looking at our data again.
Credit Card Fraud Data 1

Credit Card Fraud Data 2
We can see that our data set is made up of a "Row Number" column, 30 numeric columns and a "Class" column.  For more information about what these columns mean and how they were created, read our previous post.  In our experiment, we want to create a model to predict when a particular transaction is fraudulent.  This is the same as predicting when the "Class" column equals 1.  Let's take a look at the "Class" column.
Class Statistics

Class Histogram
 Looking the histogram, we can see that we have heavily skewed data.  A simple math trick tells us that we can determine the Percentage of "1" values simply by looking at the mean times 100.  Therefore, we can see that 0.13% of our records are fraudulent.  This is what's known as an "imbalanced class".  An imbalanced class problem is especially tricky because we have to use a new set of evaluation metrics.  For instance, if we were to always guess that every record is not fraudulent, we would be correct 99.87% of the time.  While these seem like amazing odds, they are completely worthless for our analysis.  If you want to learn more, a quick google search brought up this interesting article that may be worth a read.  We'll touch on this more in a later post.  For now, let's keep this in the back of our mind and move on to summarizing our data.
Credit Card Fraud Summary 1

Credit Card Fraud Summary 2

Credit Card Fraud Summary 3
A few things stick out when we look at this.  First, all of the features except "Class" have missing values.  We need to take care of this.  Second, the "Class" features doesn't have missing values.  This is great!  Given that our goal is to predict fraud, it would be pretty pointless if some of our records didn't have a known value for "Class".  Finally, it's important to note that all of our variables are numeric.  Most machine learning algorithms cannot accept string values as input.  However, most of the Azure Machine Learning algorithms will transform any string features into numeric features.  You can find out more about Indicator Variables in an earlier post.  Alas, let's look at some of the ways to deal with our missing values.  Cue the "Clean Missing Data" module.
Clean Missing Data
The task of cleaning missing data is known as Imputation.  Given its importance, we've touched on it a couple of times on this blog (here and here).  The goal of imputation is to create a data set that gives us the "most accurate" answer possible.  That's a very vague concept.  However, we have a big advantage in that we have a data set with known "Class" values to test against.  Therefore, we can try a few different options to see which ones work best with our data and our models.

In the previous posts, we've focused on "Custom Substitution Value" just to save time.  However, our goal in this experiment is to create the most accurate model possible.  Given that goal, it would seem like a waste not to use some of more powerful tools in our toolbox.  We could use some of the simpler algorithms like Mean, Median or Mode.  However, we have a large number of dense features (this is a result of the Principal Component Analysis we talked about in the previous post).  This means that we have a perfect use case for the heavy-hitters in the toolbox, MICE and Probabilistic PCA (PPCA).  Whereas the Mean, Median and Mode algorithms determine a replacement value by utilizing a single column, the MICE and PPCA algorithm utilize the entire dataset.  This makes them extremely powerful at providing very accurate replacements for missing values.

So, which should we choose?  This is one of the many crossroads we will run across in this experiment; and the answer is always the same.  Let the data decide!  There's nothing stopping us from creating two streams in our experiment, one which uses MICE and one which uses PPCA.  If we were so inclined, we could create additional streams for the other substitution algorithms or a stream for no substitution at all.  Alas, that would greatly increase the development effort, without likely paying off in the end.  For now, we'll stick with MICE and PPCA.  Which one's better?  We won't know that until later in the experiment.

Hopefully, this post enlightened you to some of the ways that you can use Imputation and Data Cleansing to provide additional power to your models.  There was far more we could do here.  In fact, many data scientists approaching hard problems will spend most of their time adding new variables and transforming existing ones to create even more powerful models.  In our case, we don't like putting in the extra work until we know it's necessary.  Stay tuned for the next post where we'll talk about model selection.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, July 24, 2017

Azure Machine Learning in Practice: Fraud Detection

So far, we've been walking through the different algorithms and tools for solving different problems.  However, we've never delved into how a data scientist would solve a real-world problem.  This next series is going to focus on a real data set from www.kaggle.com.  For those that are unfamiliar with Kaggle, it's a website that hosts data science competitions that allow users from all over the world to use whatever tools and algorithms they would like in order to solve a problem.  This data set focuses on credit card fraud.  Specifically, the goal is to use a large set of anonymized data to create a fraud detection algorithm.  You can find out more about the data set here.

Some of you may be thinking "I thought this was going to be a real problem, not a fake one!"  Turns out, we solved this kaggle problem in almost exactly the same way that we've solved real customers problems at work.  The only difference here is that this data has been anonymized in order to protect everyone's privacy.

For this post, let's take a look at the data set.
Credit Card Fraud Data 1

Credit Card Fraud Data 2
We can see that this data set has the following features: "Row Number", "Time", "V1"-"V28", "Amount" and "Class".  The "Row Number" feature is simply used as a row identifier and should not be included in any of the models or analysis.  The "Time" column represents the number of seconds between the current transaction and the first transaction in the dataset.  This information could be very useful because transactions that occur very rapidly or at constant increments could be an indicator of fraud.  The "Amount" feature is the value of the transaction.  The "Class" feature is our fraud indicator.  If a transaction was fraudulent, this feature would have a value of 1.

Finally, let's talk about the "V1"-"V28" columns.  These columns represent all of the other data we have about these customers and transactions combined into 28 numeric features.  Obviously, there were far more than 28 original feature.  However, in order to anonymize the data and reduce the number of features, the creator of the data set used a technique known as Principal Component Analysis (PCA).  This is a well-known mathematical technique for creating a small number of very dense columns using a large number of sparse columns.  Fortunately for the creators of this data set, it also has the advantage of anonymizing any data you use it on.  While we won't dig into PCA in this post, there is an Azure Machine Learning module called Principal Component Analysis that will perform this technique for you.  We may cover this module in a later post.  Until then, you can read more about it here.

Summarize Data
Another interesting aspect to note is that this data set contains around 200,000 rows and has a significant number of missing values.  This was not a part of the original data set provided by Kaggle.  We use this data set as an example for some of our training sessions (training people, not training models).  Therefore, we wanted to add some additional speed bumps to the data in order to enhance the value of the training.  So, instead of using the single large data sets provided by Kaggle, we provide a training set, which has missing values, and a testing set, which does not.  If you would like to use these datasets instead, you can find them here.

Hopefully we've piqued your interest about Fraud Detection in Azure Machine Learning.  Feel free to hop right into the analysis and see what you can do on your own.  Maybe you'll create a better model than us!  Stay tuned for the next post where we'll be talking about cleaning up this data and preparing it for modelling.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, July 3, 2017

Azure Machine Learning: Cross-Validation for Regression

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression, Boosted Decision Tree Regression, Poisson Regression and Model Evaluation phases of the experiment.  We also discussed Normalization.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.

To finish this experiment, let's take a look at Cross-Validation.  We briefly touched on this topic in a previous post as it relates to Classification models.  Let's dig a little deeper into it.

As we've mentioned before, it is extremely important to have a separation between Training data (data used to train the model) and Testing data (data used to evaluate the model).  This separation is necessary because it allows us to determine how well our model could predict "new" data.  In this case, "new" data is data that the model has not seen before.  If we were to train the model using a data set, then evaluate the model using the same data set, we would have no way to determine whether they model is good at fitting "new" data or only good at fitting data it has already seen.

In practice, these data sets are often created by cleaning and preparing a single data set that contains all of the variables needed for modelling, as well as a column or columns containing the results we are trying to predict.  Then, this data is split into two data sets.  This process is generally random, with a larger portion of the data going to the training set than to the testing set.  However, this methodology has a major flaw.  How do we know if we got a bad sample?  What if by random chance, our training data was missing a significant pattern that existed in our testing set, or vice-versa?  This could cause us to inappropriately identify our model as "good" due to something that is entirely outside of our control.  This is where Cross-Validation comes into play.

Cross-Validation is a process for creating multiple sets of testing and training sets, using the same set of data.  Imagine that we split our data in half.  For the first model, we train the model using the first half of the data, then we test the model using the second half of the data.  Turns out that we can repeat this process by swapping the testing and training sets.  So, we can train the second model using the second half of the data, then test the model using the first half of the data.  Now, we have two models trained with different training sets and tested with different testing sets.  Since the two testing sets did not contain any of the same elements, we can combine the scored results together to create a master set of scored data.  This master set of scored data will have a score for every record in our original data set, without ever having a single model score a record that it was trained with.  This greatly reduces the chances of getting a bad sample because we are effectively scoring every element in our data, not just a small portion.

Let's expand this method a little.  First, we need to break our data into three sets.  The first model is trained using sets 1 and 2, and tested using set 3.  The second model is trained using sets 1 and 3, and tested using set 2.  The third model is trained using sets 2 and 3, and tested using set 1.  In practice, the sets are known as "folds". As you can see, we can extend this method out as far as we would like to create a master set of predictions.

K Fold Cross-Validation
There's even a subset of Cross-Validation known as "Leave One Out" where you train the model using all but one record, then score that individual record using the trained model.  This process is obviously repeated for every record in the data set.

Now, the question becomes "How many folds should I have? 5? 10? 20? 1000?"  This is a major question that some data scientists spend quite a bit of time working with.  Fortunately for us, Azure ML automatically uses 10 folds, so we don't have to worry too much about this question.  Let's see it in action.

If we look back at our previous post, we can see that the normalization did not improve any of our models.  So, for simplicity, let's remove normalization.
Experiment So Far (No Normalization)
Now, let's take a look at the Cross Validate Model module.
Cross Validate Model
We see that this module takes an untrained model and a data set.  It also requires us to choose which column we would like to predict, "Price" in our case.  This module outputs a set of scored data, which looks identical to the scored data we get from the Score Data module, although it contains all of the records, not just those in the testing set.  It also outputs a set of Evaluation Results by Fold.
Evaluation Results by Fold
This output shows a record for every fold in our Cross-Validation.  Since there were ten folds, we have ten rows, plus additional rows for the mean (average) and standard deviation of each column.  The columns in these results show some basic summary statistics for each fold.  This data leans more heavily towards experienced data scientists.  However, it can be easily used to recognize if a particular fold is strikingly different from the others.  This could be helpful for determining whether there are subsets within our data that may contain completely different patterns than the rest of the data.

Next, we have to consider which model to input into the Cross-Validation.  The issue here is that the Cross Validate Model module requires an untrained data set, while the Tune Model Hyperparameters module outputs a trained data set.  So, in order to use our tuned models from the previous posts, we'll need to manually copy the tuned parameters from the Tune Model Hyperparameters module into the untrained model modules.
Online Gradient Descent Linear Regression Tuned Parameters
Online Gradient Descent Linear Regression
Next, we need to determine which models we want to compare.  While it may be mildly interesting to compare the linear regression model trained with the 70/30 split to the linear regression model trained with Cross-Validation, that would be more of an example of the effect that Cross-Validation can have on the results.  This doesn't have much business value.  Instead, let's compare the results of the Cross-Validation models to see which model is best.  Remember, the entire reason that we used Cross-Validation was to minimize the impact that a bad sample could have on our models.  In the previous post, we found that the Poisson Regression model was the best fit.
Ordinary Least Squares Linear Regression vs Online Gradient Descent Linear Regression

Boosted Decision Tree Regression vs. Poisson Regression
We can see that Poisson Regression still has the highest Coefficient of Determination, meaning that we can once again determine that it is the best model.  However, this leads us to another important question.  Why is it important to perform Cross-Validation if it didn't change the result?  The primary answer to this involves "trust".  Most of what we do in the data science world involves math that most business people would see as voodoo.  Therefore, one of the easiest paths to success is to gain trust, which can be found via a preponderance of evidence.  The more evidence we can provide showing that our algorithm is reliable and accurate, the easier it will be us to convince other people to use it.

Hopefully, this series opened your mind as to the possibilities of Regression in Azure Machine Learning.  Stay tuned for more posts.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 12, 2017

Azure Machine Learning: Normalizing Features for Regression

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression, Boosted Decision Tree Regression, Poisson Regression and Model Evaluation phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.

In the previous post, we were able to calculate important evaluation statistics for our regression models, R Squared being the most important.  However, we left out a very important concept known as Normalization.

Many statistical algorithms (including some regression algorithms) attempt to determine the "best" model by reducing the variance of something (often the residuals).  However, this can be a problem when we are dealing with features on massively different scales.  Let's start by considering the calculation for variance.  The calculation starts by taking an individual value and subtracting the mean (also known as average).  This means that for very large values (like "Price" in our dataset), this difference will be very large, while for small values (like "Stroke" and "Bore" in our dataset), this difference will be very small.  Then, we square this value, making the difference even larger (and always positive).  Finally, we repeat this process for the rest of the values in the column, then add them together and divide by the number of records.

So, if we asked an algorithm to minimize this value across a number of different factors, we would find that it would almost always minimize the variance for the largest features, while completely ignoring the small features.  Therefore, it would be extremely helpful if we could take all of our features, and put them on the same scale.  This is what normalization does.  Let's take a look at the module in Azure ML.
Normalize Data
We can see that the "Normalize Data" module takes our data and applies a single "Transformation Method" to the columns of our choice.  In this experiment, we'll stick with using the ZScore transformation.  We may dig deeper into the other methods in a later post.  Also, we are choosing to exclude the "Price" column from our normalization.  In most cases, there's not much harm in normalizing the dependent variable.  However, we're withholding it for two reasons.  First, if we were to normalize the "Price" column, then we would get normalized predictions out of the model.  This would mean that we would have to reverse the transformation in order to get back to our original scale.  Second, Poisson Regression requires a positive whole number as the dependent variable, which is not the case with normalization, which can and will produce positive and negative values centered around 0.  Let's take a look at the visualization.
Normalize Data (Visualization)
We can see that these values are no longer large whole numbers like they were before.  Instead, they are small positive and negative decimals.  It's important to note that the Mean and Standard Deviation of these normalized features are very close to 0 and 1, respectively.  This is exactly what the ZScore transformation does.  However, the true purpose of this normalization is to see if it has any impact on our regression models.  Let's take a look.  For all of these evaluations, the unmodified values are used in the left model and normalized values are used in the right model.
Ordinary Least Squares Linear Regression
Online Gradient Descent Linear Regression
Boosted Decision Tree Regression
Poisson Regression
Before we began this experiment, we already knew that Linear and Boosted Decision Tree Regression were robust against normalization (meaning that normalizing the features wouldn't have an impact).  However, the MSDN article for Poisson Regression specifically states that we should normalize our features.  Given the underlying mathematics and the test we just conducted, we're not sure why this would be necessary.  If anyone has any ideas, feel free to leave a comment.  Alas, the point of this experiment is still valid.  There are some algorithms where Normalizing features ahead of time is necessary.  K-Means Clustering is one such algorithm.

With this in mind, we can conclusively say that Poisson Regression (without normalization) created the best model for our situation.  Hopefully, this experiment has enlightened you to all the ways in which you can use Regression in your organization.  Regression truly is one of the easiest techniques to use in order to gain tremendous value.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 15, 2017

Azure Machine Learning: Regression Model Evaluation

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression, Boosted Decision Tree Regression and Poisson Regression phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  While there are more regression techniques available in Azure ML, we're going to talk about the steps for evaluating our regression models.  Let's start by talking about Testing and Training Sets.

So far in our experiment, we've trained our models using the entire data set.  This is great because it gives the algorithm as much information as possible to make the best possible model.  Now, let's think about how we would test the model.  Testing the model, also known as model evaluation, requires that we determine how well the model would predict for data it hasn't seen yet.  In practice, we could easily train the model using the entire dataset, then use that same dataset to test the model.  However, that would determine how well the model can predict for data is has already seen, which is not the purpose of testing.

The most common approach to alleviate this issue is to split your data into two different sets, a Training Set and a Testing Set.  The Training Set is used to train the model and the Testing Set is used to test the model.  Using this methodology, we are testing the model with data it hasn't seen yet, which is the entire point.  To do this, we'll use the "Split Data" module.
Split Data
For this experiment, we'll be using "Split Rows" as our "Splitting Mode".  The other choices are more complex.  You can read about them here.

The "Fraction of Rows in the First Output Dataset" defines how many rows will be passed through the left output of the module.  In our case, we'll use .7 (or 70%) of our data for our Training Set and the remaining 30% for our Testing Set.  This is known as a 70/30 split and is generally considered the standard way to split.

The "Randomized Split" option is very important.  If we were to deselect this option, the first segment of our rows (70% in this case) would go through the left output and the last segment (30% in this case) would go through the right side.  Ordinarily, this is not what we would want.  However, there may be some obscure cases where you could apply a special type of sorting beforehand, then use this technique to split the data.  If you know of any other reasons, feel free to leave us a comment.

The "Random Seed" option allows us to split our data the same way every time.  This is really helpful for demonstration and presentation purposes.  Obviously, this parameter only matters if we have "Randomized Split" selected.

Finally, the "Stratified Split" option is actually quite significant.  If we select "False" for this parameter, we will be performing a Simple Random Sample.  This means that Azure ML will randomly choose a certain number of rows (70% in our case) with no regard for what values they have or what order they appeared in the original dataset.  Using this method, it's possible (and actually quite common) for the resulting dataset to be biased in some way.  Since the algorithm doesn't look at the data, it has no idea whether it's sampling too heavily from a particular category or not.
Price (Histogram)
If we look at the "Price" column in our dataset, we see that it is heavily right-skewed.  If we were to perform a Simple Random Sample on this data, it very possible that we could miss most (if not all) of the vehicles with high prices.  This would bias our sample toward vehicles with lower prices, potentially hurting our predictive ability.
Split Data (Stratified)
One way to alleviate this is to perform a Stratified Sample.  This can be accomplished by selecting "True" for the "Stratified Split" option.  Next, we need to select a column (or set of columns).  Basically, Azure ML will break the data up into segments based on the unique values in the select column (or combination of columns).  Then, it will perform a Simple Random Sample on each segment.  This can be extremely helpful if your data set has some extremely important factors that need to have the similar distributions in the Training Set and the Testing Set.  For instance, if we were trying to classify whether a car would sell (Yes or No), then it would be a very good idea to take a Stratified Sample over the "Sold" column.  However, this technique should not be used when the Stratification column has a large number of unique values.  This is the case with our price column.  Therefore, Stratification is not appropriate here.

As an interesting side note, we experimented with this technique and found that creating Stratified Samples over Continuous variables (like price) can have interesting results.  For instance, we built a dataset that contained a single column with the values 1 through 100 and no duplicates.  When we tried to pull a 50% Stratified Sample of this data, we found that Azure ML takes a sample at every possible value.  This means that it will try to take a 50% Simple Random Sample of a single row containing the value 5, for instance.  In every case, taking a sample of a single row will guarantee that the row gets returned, regardless of the sampling percentage.  Therefore, we ended up with all 100 rows in our left output and 0 rows in our right output, even though we wanted a 50/50 split.

Now that we've created our 70/30 split of the data, let's look at how it fits into the experiment.  We'll start by looking at the two different layouts, Train Model (which is used for Ordinary Least Squares Linear Regression) and Tune Model Hyperparameters (which is used for the other 3 regression algorithms).
Train Model

Tune Model Hyperparameters
We can see that the "Train Model" layout is pretty simple, the Training Set goes into the "Train Model" module and the Testing Set goes into the "Score Model" module.  However, in the "Tune Model Hyperparameters" layout, the Training Set and Testing Set both go in the "Tune Model Hyperparameters" module and the Testing Set also goes into the "Score Model" module.  Let's add all of these together and drop some "Evaluate Model" modules onto the end.
Model Evaluation
Unfortunately, we need two "Evaluate Model" modules because they can only accept two inputs.  Let's take a look at the visulization of the "Evaluate Model" module.
Ordinary Least Squares vs. Online Gradient Descent
Boosted Decision Tree vs Poisson


Some you may recognize that these visualizations look completely different than those created by the "Evaluate Model" module we used in our previous posts about ROC, Precision, Recall and Lift.  This is because the model evaluation metrics for Regression are completely different than those for Classification.  Looking at the Metrics section, we see that the first four metrics are Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error and Relative Squared Error.  All of these measures tell us how far our predictions deviate from the actual values they are supposed to predict.  We don't generally pay much attention to these, but we do want to minimize these in practice.

The measure we are concerned with is Coefficient of Determination, also known as R Squared.  We've mentioned this metric a number of times during this blog series, but never really described what it tells us.  Basically, R Squared tells us how "good" our  model is at predicting.  Higher values of R Squared are good and smaller values are bad.  It's difficult to determine what's an acceptable value.  Some people say .7 and other people say .8.  If we find anything lower than that, we might want to consider a different technique.  Fortunately for us, the Poisson Regression model has an R Squared of .97, which is extremely good.  As a side note, one of the reasons why we generally don't consider the values of the first four metrics is because they rarely differ from one another.  If R Squared tells you that one model is the best, it's likely that all of the other metrics will tell you the same thing.  R Squared simply has the advantage of being bounded between 0 and 1, which means we can identify not only which model is best, but also if that model is "good enough".

To finalize this evaluation, it seems that the Poisson Regression algorithm is the best model for this data.  This is especially interesting because, in the previous post, we commented on the fact that the Poisson Distribution is not theoretically appropriate for this data.  Given this, it may be a good idea to find an additional set of validation data to confirm our results.  Alas, that's beyond the scope of this experiment.

Hopefully, this discussion has opened your horizons to the possibilities of using Regression to answer some more complex business problems.  Stay tuned for the next post where we'll be talking about Normalizing Features for Regression.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 1, 2017

Azure Machine Learning: Regression Using Poisson Regression

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset.  In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression and Boosted Decision Tree Regression phases of the experiment.
Experiment So Far
Let's refresh our memory on the data set.
Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2
We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price.  The goal of this experiment is to attempt to predict the price of the car based on these factors.  Specifically, we're going to be walking through the Poisson Regression algorithm.  Let's start by talking about Poisson Regression.

Poisson Regression is used to predict values that have a Poisson Distribution, i.e. counts within a given timeframe.  For example, the number of customers that enter a store on a given day may follow a Poisson Distribution.  Given that these values are counts, there are a couple of caveats.  First, the counts cannot be negative. Second, the counts could theoretically extend to infinity.  Finally, the counts must be Whole Numbers.

Just by looking at these three criteria, it may seem like Poisson Regression is theoretically appropriate for this data set.  However, the issue comes when we consider the mathematical underpinning of the Poisson Distribution.  Basically, the Poisson Distribution assumes that each entity being counted operates independently of the other entities.  Back to our earlier example, we assume that each customer entering the store on a given day does so without considering whether the other customers will be going to the store on that day as well.  Comparing this to our vehicle price data, that would be akin to saying that when a car is bought, each dollar independently decides whether it wants to jump out of the buyer's pocket and into the seller's hand.  Obviously, this is a ludicrous notion.  However, we're not theoretical purists and love bending rules (as long as they produce good results).  For us, the true test comes from the validation portion of the experiment, which we'll cover in a later post.  If you want to learn more about Poisson Regression, read this and this.  Let's take a look at the parameters for this module.
Poisson Regression
The Poisson Regression algorithm uses an optimization technique known as Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS).  Basically, this technique tries to find the "best" set of parameters to fill in our Poisson Regression equation, which is described in detail here.  In practice, the smaller we make "Optimization Tolerance", the longer the algorithm will take to train and the more accurate the results should be.  This value can be optimized using the "Tune Model Hyperparameters" module.

Without going into too much depth, the "L1 Regularization Weight" and "L2 Regularization Weight" parameters penalize complex models.  If you want to learn more about Regularization, read this and this.  As with "Optimization Tolerance", Azure ML will choose this value for us.

"Memory Size for L-BFGS" specifies the amount of memory allocated to the L-BFGS algorithm.  We can't find much more information about what effect changing this value will have.  Through some testing, we did find that this value had very little impact on our model, regardless of how large or small we made it (the minimum value we could provide is 1).  However, if our data set had an extremely large number of columns, we may find that this parameter becomes more significant.  Once again, we do not have to choose this value ourselves.

The "Random Number Seed" parameter allows us to create reproducible results for presentation/demonstration purposes.  Oddly enough, we'd expect this value to play a role in the L-BFGS algorithm, but it doesn't seem to.  We were unable to find any impact caused by changing this value.

Finally, we can choose to deselect "Allow Unknown Categorical Levels".  When we train our model, we do so using a specific data set known as the training set.  This allows the model to predict based on values it has seen before.  For instance, our model has seen "Num of Doors" values of "two" and "four".  So, what happens if we try to use the model to predict the price for a vehicle with a "Num of Doors" value of "three" or "five"?  If we leave this option selected, then this new vehicle will have its "Num of Doors" value thrown into an "Unknown" category.  This would mean that if we had a vehicle with three doors and another vehicle with five doors, they would both be thrown into the same "Num of Doors" category.  To see exactly how this works, check out of our previous post, Regression Using Linear Regression (Ordinary Least Squares).

The options for the "Create Trainer Mode" parameter are "Single Parameter" and "Parameter Range".  When we choose "Parameter Range", we instead supply a list of values for each parameter and the algorithm will build multiple models based on the lists.  These multiple models must then be whittled down to a single model using the "Tune Model Hyperparameters" module.  This can be really useful if we have a list of candidate models and want to be able to compare them quickly.  However, we don't have a list of candidate models, but that actually makes "Tune Model Hyperparameters" more useful.  We have no idea what the best set of parameters would be for this data.  So, let's use it to choose our parameters for us.
Tune Model Hyperparameters
Tune Model Hyperparameters (Visualization)
We can see that there is very little difference between the top models using Coefficient of Determination, also known as R Squared.  This is a great sign because it means that our model is very robust and we don't have to sweat over choosing the perfect parameters.

On a side note, there is a display issue causing some values for the "Optimization Tolerance" parameter to display as 0 instead of whatever extremely small value they actually are.  This is disappointing as it limits our ability to manually type these values into the "Poisson Regression" module.  One of the outputs from the "Tune Model Hyperparameters" module is the Trained Best Model, whichever model appears at the top of list based on the metric we chose.  This means that we can use this as an input into other modules like "Score Model".  However, it does mean that we cannot use these parameters in conjunction with the "Cross Validate Model" module as that requires an Untrained Model as an input.  Alas, this is not a huge deal because we see that "Optimization Tolerance" does not have a very large effect on the resulting model.
All Regression Models Complete
Hopefully we've laid the groundwork for you to understand Poisson Regression and utilize it in your work.  Stay tuned for the next post where we'll be talking about Regression Model Evaluation.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com