Monday, September 25, 2017

Azure Machine Learning in Practice: Model Evaluation

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing and Model Selection phases of the experiment.  In this post, we're going to walk through the model evaluation process.

Model evaluation is "where the rubber meets the road", as they say.  Up until now, we've been building a large list of candidate models.  This is where we finally choose the one that we will use.  Let's take a look at our experiment so far.
Experiment So Far
We can see that we have selected two candidate imputation techniques and fourteen candidate models.  However, the numbers are about get much larger.  Let's take a look at the workhorse of our experiment, "Tune Model Hyperparameters".
Tune Model Hyperparameters
We've looked at this module in some of our previous posts (here and here).  Basically, this module works by allowing us to define (or randomly choose) sets of hyperparameters for our models.  For instance, if we run the "Two-Class Boosted Decision Tree" model through this module with our training and testing data, we get an output that looks like this.
Sweep Results (Two-Class Boosted Decision Tree)
The result of the "Tune Model Hyperparameters" module is a list of hyperparameter sets for the input model.  In this case, it is a list of hyperparameters for the "Two-Class Boosted Decision Tree" model, along with various evaluation metrics.  Using this module, we can easily test tens, hundreds or even thousands of different sets of hyperparameters in order to find the absolute best set of hyperparameters for our data.

Now, we have a way to choose the best possible model.  The next step is to choose which evaluation metric we will use to rank all of these candidate models.  The "Tune Model Hyperparameters" module has a few options.
Evaluation Metrics
This is where a little bit of mathematical background can help tremendously.  Without going into too much detail, there's a problem with using some of these metrics on our dataset.  Let's look back at our "Class" variable. 
Class Statistics
Class Histogram
We see that the "Class" variable is extremely skewed, with 99.87% of all observations having a value of 0.  Therefore, traditional metrics such as Accuracy and AUC are not acceptable.  To further understand this, imagine if we built a model that always predicted 0.  That model would have an accuracy of 99.87%, despite being completely useless for our use case.  If you want to learn more, you can check out this whitepaper.  Now, we need to utilize a new set of metrics.  Let's talk about Precision and Recall.

Precision is the percentage of predicted "positive" records (Class = 1 -> "Fraud" in our case) that are correct.  Notice that we said PREDICTED.  Precision looks at the set of records where the model thinks Fraud has occurred.  This metric is calculated as

(Number of Correct Positive Predictions) / (Number of Positive Predictions)

One of the huge advantages of Precision is that it doesn't care how "rare" the positive case is.  This is extremely beneficial in our case because, in our opinion, 0.13% is extremely rare.  We can see that we want precision to be as close to 1 (or 100%) as possible.

On the other hand, Recall is the percentage of actual "positive" records that are correct.  This is slightly different from Precision in that it looks at the set of records where Fraud has actually occurred.  This metric is calculated as

(Number of Correct Positive Predictions) / (Number of Actual Positive Records)

Just as with Precision, Recall doesn't care how rare the positive case is.  Also like Precision, we want this value to be as close to 1 as possible.

In our minds, Precision is a measure of how accurate your fraud predictions are, while Recall is a measure of how much fraud the model is catching.  Let's look back at our evaluation metrics for the "Tune Model Hyperparameters" module.
Evaluation Metrics
We can see that Precision and Recall are both in this list.  So, which one do we choose?  Honestly, we don't have an answer for this.  So, we'll go back to our favorite method, try them both!
Model Evaluation Experiment
This is where Azure Machine Learning really provides value.  In about thirty minutes, we were able to set up this experiment that's going to create two evaluation metrics against fourteen sets of twenty models utilizing two cleansing techniques.  That's a total of 1,120 models!  After this finishes, we copy all of these results out to an Excel spreadsheet so we can take a look at them.
MICE - Precision - Averaged Perceptron Results
Our Excel document is simply a series of tables very similar to this one.  They show the parameters used for the model, as well as the evaluation statistics for that model.  Using this, we could easily find the combination of model, parameters and cleansing technique that gives us the highest Precision or Recall.  However, this still requires us to choose one or the other.  Looking back at the definitions of these metrics, they cover two different, important cases.  What if we want to maximize both?  Since we have the data in Excel, we can easily add a column for Precision * Recall and find the model that maximizes that value.
PPCA - Recall - LD SVM - Binning Results
As we can see from this table, the best model for this dataset is to clean the data using Probabilistic Principal Component Analysis, then model the data using a Locally-Deep Support Vector Machine with a Depth of 4, Lambda W of .065906, Lambda Theta Prime of .003308, Sigma of .106313 and 14,389 Iterations.  A very important consideration here is that we will not get the same results by copy-pasting these parameter values into the "Locally-Deep Support Vector Machine" module.  That's because these values are rounded.  Instead, we should save the best module directly to our Azure ML workspace.
Save Trained Model
At this point, we could easily consider this problem solved.  We have created a model that catches 90.2% of all fraud with a precision of 93.0%.  A very important point to note about this whole exercise is that we did not use domain knowledge, assumptions or "Rules of Thumb" to drive our model selection process.  Our model was selected entirely by using the data.  However,  there are a few more steps we can perform to tweak more power and performance out of our model.  Hopefully, this has opened your eyes to the Model Evaluation power of Azure Machine Learning.  Stay tuned for the next post where we'll discuss Threshold Selection.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, September 4, 2017

Azure Machine Learning in Practice: Model Selection

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation and Data Cleansing phases of the experiment.  In this post, we're going to walk through the model selection process.

In traditional machine learning and data science applications, model selection is a time-consuming process that generally requires a significant amount of statistical background.  Azure Machine Learning completely breaks this paradigm.  As you will see in the next few posts, model selection is Azure Machine Learning requires nothing more than a basic understanding of the problem we are trying to solve and a willingness to let the data pick our model for us.  Let's take a look at our experiment so far.
Experiment So Far
We can see that we've already imported our data and decided to use two different imputation methods, MICE and Probabilistic PCA.  Now, we need to select which models we would like to use to solve our problem.  It's important to remember that our goal is predict when a transaction is fraudulent, i.e. has a "Class" value of 1.  Before we do that, we should remember to remove the "Row Number" feature from our dataset, as it has no analytical value.
Select Columns in Dataset
Now, let's take a look at our model options.
Initialize Model
Using the toolbox on the left side of the Azure Machine Learning Studio, we can work our way down to the "Initialize Model" section.  Here, we have four different types of models, "Anomaly Detection", "Classification", "Clustering" and "Regression".

"Anomaly Detection" is the area of Machine Learning where we try to find things that look "abnormal".  This is an especially difficult task because it requires defining what's "normal".  Fortunately, Azure ML has some great tools that handle the hard work for us.  These types of models are very useful for Fraud Detection in areas like Credit Card and Online Retail transactions, as well Fault Detection in Manufacturing.  However, our training data already has fraudulent transactions labelled.  Therefore, Anomaly Detection may not be what we're looking for.  However, one of the great things about Data Science is that there are no right answers.  Feel free to add some Anomaly Detection algorithms to the mix if you would like.

"Classification" is the area of Machine Learning where we try to determine which class a record belongs to.  For instance, we can look at information about a person and attempt to determine where they are likely to buy a particular product.  This technique requires that we have an initial set of data where already know the classes.  This is the most commonly used type of algorithm and can be found in almost every subject area.  It's not coincidence that our variable of interest in this experiment is called "Class".  Since we already know whether each of these transactions was fraudulent or not, this is a prime candidate for a "Classification" algorithm.

"Clustering" is the area of Machine Learning where we try to group records together to identify which records are "similar".  This is a unique technique belonging to a category of algorithms known as "Unsupervised Learning" techniques.  They are unsupervised in the sense that we are not telling them what to look for.  Instead, we're simply unleashing the algorithm on a data set to see what patterns it can find.  This is extremely useful in Marketing where being able to identify "similar" people is important.  However, it's not very useful for our situation.

"Regression" is the area of Machine Learning where try to predict a numeric value by using other attributes related to it.  For instance, we can use "Regression" techniques to use information about a person to predict their salary.  "Regression" has quite a bit in common with "Classification".  In fact, there are quite a few algorithms that have variants for both "Classification" and "Regression".  However, our experiment only wants to predict a binary (1/0) variable.  Therefore, it would be inappropriate to use a "Regression" algorithm.

Now that we've decided "Classification" is the category we are looking for, let's see what algorithms are underneath it.
Classification
For the most part, we can see that there are two types of algorithms, "Two-Class" and "Multiclass".  Since the variable we are trying to predict ("Class") only has two values, we should use the "Two-Class" algorithms.  But which one?  This is the point where Azure Machine Learning really stands out from the pack.  Instead of choosing one, or even a few, algorithms, we can try them all.  In total, there are 9 different "Two-Class Classification" algorithms.  However, in the next post, we'll be looking at the "Tune Model Hyperparameters" module.  Using this module, we'll find out that there are actually 14 distinct algorithms, as some of the algorithms have a few different variations and one of the algorithms doesn't work with "Tune Model Hyperparameters".  Here's the complete view of all the algorithms.
Two-Class Classification Algorithms
For those that may have issues seeing the image, here's a list.

Two-Class Averaged Perceptron
Two-Class Boosted Decision Tree
Two-Class Decision Forest - Resampling: Replicate
Two-Class Decision Forest - Resampling: Bagging
Two-Class Decision Jungle - Resampling: Replicate
Two-Class Decision Jungle - Resampling: Bagging
Two-Class Locally-Deep Support Vector Machine - Normalizer: Binning
Two-Class Locally-Deep Support Vector Machine - Normalizer: Gaussian
Two-Class Locally-Deep Support Vector Machine - Normalizer: Min-Max
Two-Class Logistic Regression
Two-Class Neural Network - Normalizer: Binning
Two-Class Neural Network - Normalizer: Gaussian
Two-Class Neural Network - Normalizer: Min-Max
Two-Class Support Vector Machine

A keen observer may notice that the "Two-Class Bayes Point Machine" model was not included in this list.  For some reason, this model cannot used in conjunction with "Tune Model Hyperparameters".  However, we will handle this in a later post.

Hopefully, this post helped shed some light on "WHY" you would choose certain models over others.  We can't stress enough that the path to success is to let the data decide which model is best, not "rules-of-thumb" or theoretical guidelines.  Stay tuned for the next post, where we'll use large-scale model evaluation to pick the best possible model for our problem.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com