Monday, August 14, 2017

Azure Machine Learning in Practice: Data Cleansing

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous post, it's highly recommended that you do, as it provides valuable context.  In this post, we're going to walk through the data cleansing process.

Data Cleansing is arguably one of the most important phases in the Machine Learning process.  There's an old programming adage "Garbage In, Garbage Out".  This applies to Machine Learning even more so.  The purpose of data cleansing is to ensure that the data we are using is "suitable" for the analysis we are doing.  "Suitable" is an amorphous term that takes on drastically different meanings based on the situation.  In our case, we are trying to accurately identify when a particular credit card purchase is fraudulent.  So, let's start by looking at our data again.
Credit Card Fraud Data 1

Credit Card Fraud Data 2
We can see that our data set is made up of a "Row Number" column, 30 numeric columns and a "Class" column.  For more information about what these columns mean and how they were created, read our previous post.  In our experiment, we want to create a model to predict when a particular transaction is fraudulent.  This is the same as predicting when the "Class" column equals 1.  Let's take a look at the "Class" column.
Class Statistics

Class Histogram
 Looking the histogram, we can see that we have heavily skewed data.  A simple math trick tells us that we can determine the Percentage of "1" values simply by looking at the mean times 100.  Therefore, we can see that 0.13% of our records are fraudulent.  This is what's known as an "imbalanced class".  An imbalanced class problem is especially tricky because we have to use a new set of evaluation metrics.  For instance, if we were to always guess that every record is not fraudulent, we would be correct 99.87% of the time.  While these seem like amazing odds, they are completely worthless for our analysis.  If you want to learn more, a quick google search brought up this interesting article that may be worth a read.  We'll touch on this more in a later post.  For now, let's keep this in the back of our mind and move on to summarizing our data.
Credit Card Fraud Summary 1

Credit Card Fraud Summary 2

Credit Card Fraud Summary 3
A few things stick out when we look at this.  First, all of the features except "Class" have missing values.  We need to take care of this.  Second, the "Class" features doesn't have missing values.  This is great!  Given that our goal is to predict fraud, it would be pretty pointless if some of our records didn't have a known value for "Class".  Finally, it's important to note that all of our variables are numeric.  Most machine learning algorithms cannot accept string values as input.  However, most of the Azure Machine Learning algorithms will transform any string features into numeric features.  You can find out more about Indicator Variables in an earlier post.  Alas, let's look at some of the ways to deal with our missing values.  Cue the "Clean Missing Data" module.
Clean Missing Data
The task of cleaning missing data is known as Imputation.  Given its importance, we've touched on it a couple of times on this blog (here and here).  The goal of imputation is to create a data set that gives us the "most accurate" answer possible.  That's a very vague concept.  However, we have a big advantage in that we have a data set with known "Class" values to test against.  Therefore, we can try a few different options to see which ones work best with our data and our models.

In the previous posts, we've focused on "Custom Substitution Value" just to save time.  However, our goal in this experiment is to create the most accurate model possible.  Given that goal, it would seem like a waste not to use some of more powerful tools in our toolbox.  We could use some of the simpler algorithms like Mean, Median or Mode.  However, we have a large number of dense features (this is a result of the Principal Component Analysis we talked about in the previous post).  This means that we have a perfect use case for the heavy-hitters in the toolbox, MICE and Probabilistic PCA (PPCA).  Whereas the Mean, Median and Mode algorithms determine a replacement value by utilizing a single column, the MICE and PPCA algorithm utilize the entire dataset.  This makes them extremely powerful at providing very accurate replacements for missing values.

So, which should we choose?  This is one of the many crossroads we will run across in this experiment; and the answer is always the same.  Let the data decide!  There's nothing stopping us from creating two streams in our experiment, one which uses MICE and one which uses PPCA.  If we were so inclined, we could create additional streams for the other substitution algorithms or a stream for no substitution at all.  Alas, that would greatly increase the development effort, without likely paying off in the end.  For now, we'll stick with MICE and PPCA.  Which one's better?  We won't know that until later in the experiment.

Hopefully, this post enlightened you to some of the ways that you can use Imputation and Data Cleansing to provide additional power to your models.  There was far more we could do here.  In fact, many data scientists approaching hard problems will spend most of their time adding new variables and transforming existing ones to create even more powerful models.  In our case, we don't like putting in the extra work until we know it's necessary.  Stay tuned for the next post where we'll talk about model selection.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com