Monday, November 6, 2017

Azure Machine Learning in Practice: Feature Selection

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection, Model Evaluation and Threshold Selection phases of the experiment.  In this post, we're going to walk through the feature selection process.

In the traditional data science space, feature selection is generally one of the first phases of a modelling process.  A large reason for this is that, historically, building models using a hundred features would take a long time.  Also, an individual would have to sort through all of the features after modeling to determine what impact the features were having.  Sometimes, they would even find that some features would make the model less accurate by including them.  There's also the concept of parsimony to consider.  Basically, less variables was generally considered better.

However, technology and modelling techniques have come a long way over the last few decades.  We would be doing a great disservice to modern Machine Learning to say that it resembles traditional statistics in a major way.  Therefore, we try to approach feature selection from a more practical perspective.

First, we found that we were able to train over 1000 models in about an hour and a half.  Therefore, removing features for performance reasons is not necessary.  However, in other cases, it may be.  In those cases, paring down features initially could be beneficial.

Now, we need to determine which variables are having no (or even negative) impact on the resulting model.  If they aren't helping, then we should remove them.  To do this, we can use a technique known as Permutation Feature Importance.  Azure Machine Learning even has a built-in module for this.  Let's take a look.
Feature Selection Experiment
Permutation Feature Importance
This module requires us to input our trained model, as well as a testing dataset.  We also have to decide which metric we would like to use.  With that, it will output a dataset showing us the impact that each feature has on the model.

So, how does Permutation Feature Importance work?  Honestly, it's one of more clever algorithms we've come across.  The module chooses one feature at a time, randomly shuffles the values for that feature across the different rows, then retrains the model.  Then, it can evaluate the impact of that feature by seeing how much the trained model changed when the values were shuffled.  A very important feature would obviously cause large changes in the model if they were shuffled.  A less important feature would have less impact.  In our case, we want to measure impact by using Precision and Recall.  Unfortunately, the module only gives us the option to use one at a time.  Therefore, we'll have to be more creative.  Let's start by looking at the output of the Precision module.
Feature Importance (Precision) 1
Feature Importance (Precision) 2
We can see that there a few features that are important.  The rest of the features have no impact on the model.  Let's look at the output from the Recall module.
Feature Importance (Recall) 1
Feature Importance (Recall) 2
Now, let's compare the results of the two modules.
Feature Importance
We can see that the two modules have almost the same output, except for V4, which is only important for Precision.  This means that we should be able to remove all of the other features without affecting the model.  Let's try it and see what happens.
Feature Reduction Experiment
Tune Model Hyperparameters Results
R Script Results
We can see that removing those features from the model did reduce the Precision and Recall.  There was not a dramatic reduction in these values, but there was a reduction nonetheless.  This is likely caused by rounding error.  What was originally a very small decimal value assigned to the importance of each feature was rounded to 0, causing us to think that there was no importance.  Therefore, we are at a decision point.  Do we remove the features knowing that they slightly hurt the model?  Since we're having no performance issues and model understandability is not a factor, we would say no.  It's better to keep the original model than it is to make a slimmer version.

It is important to note that in practice, we've never seen the "Permutation Feature Importance" model throw out the majority of the features.  Usually, there are a few features that have a negative impact.  As we slowly remove them one at a time, we eventually find that most of the features have a positive impact on the model.  While we won't get into the math behind the scenes, we will say that we highly suspect this unusual case was caused by the fact that we are only given a subset of the Principal Components created using Principal Components Analysis.

Hopefully, this post enlightened you to some of the thought process behind Feature Selection in Azure Machine Learning.  Permutation Feature Importance is a fast, simple way to improve the accuracy and performance of your model.  Stay tuned for the next post where we'll be talking about Feature Engineering.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com