Breaking BI: July 2017

So far, we've been walking through the different algorithms and tools for solving different problems. However, we've never delved into how a data scientist would solve a real-world problem. This next series is going to focus on a real data set from www.kaggle.com. For those that are unfamiliar with Kaggle, it's a website that hosts data science competitions that allow users from all over the world to use whatever tools and algorithms they would like in order to solve a problem. This data set focuses on credit card fraud. Specifically, the goal is to use a large set of anonymized data to create a fraud detection algorithm. You can find out more about the data set here.

Some of you may be thinking "I thought this was going to be a real problem, not a fake one!" Turns out, we solved this kaggle problem in almost exactly the same way that we've solved real customers problems at work. The only difference here is that this data has been anonymized in order to protect everyone's privacy.

For this post, let's take a look at the data set.

Credit Card Fraud Data 1

Credit Card Fraud Data 2

We can see that this data set has the following features: "Row Number", "Time", "V1"-"V28", "Amount" and "Class". The "Row Number" feature is simply used as a row identifier and should not be included in any of the models or analysis. The "Time" column represents the number of seconds between the current transaction and the first transaction in the dataset. This information could be very useful because transactions that occur very rapidly or at constant increments could be an indicator of fraud. The "Amount" feature is the value of the transaction. The "Class" feature is our fraud indicator. If a transaction was fraudulent, this feature would have a value of 1.

Finally, let's talk about the "V1"-"V28" columns. These columns represent all of the other data we have about these customers and transactions combined into 28 numeric features. Obviously, there were far more than 28 original feature. However, in order to anonymize the data and reduce the number of features, the creator of the data set used a technique known as Principal Component Analysis (PCA). This is a well-known mathematical technique for creating a small number of very dense columns using a large number of sparse columns. Fortunately for the creators of this data set, it also has the advantage of anonymizing any data you use it on. While we won't dig into PCA in this post, there is an Azure Machine Learning module called Principal Component Analysis that will perform this technique for you. We may cover this module in a later post. Until then, you can read more about it here.

Summarize Data

Another interesting aspect to note is that this data set contains around 200,000 rows and has a significant number of missing values. This was not a part of the original data set provided by Kaggle. We use this data set as an example for some of our training sessions (training people, not training models). Therefore, we wanted to add some additional speed bumps to the data in order to enhance the value of the training. So, instead of using the single large data sets provided by Kaggle, we provide a training set, which has missing values, and a testing set, which does not. If you would like to use these datasets instead, you can find them here.

Hopefully we've piqued your interest about Fraud Detection in Azure Machine Learning. Feel free to hop right into the analysis and see what you can do on your own. Maybe you'll create a better model than us! Stay tuned for the next post where we'll be talking about cleaning up this data and preparing it for modelling. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Today, we're going to continue our walkthrough of Sample 4: Cross Validation for Regression: Auto Imports Dataset. In the previous posts, we walked through the Initial Data Load, Imputation, Ordinary Least Squares Linear Regression, Online Gradient Descent Linear Regression, Boosted Decision Tree Regression, Poisson Regression and Model Evaluation phases of the experiment. We also discussed Normalization.

Experiment So Far

Let's refresh our memory on the data set.

Automobile Price Data (Clean) 1

Automobile Price Data (Clean) 2

We can see that this data set contains a bunch of text and numeric data about each vehicle, as well as its price. The goal of this experiment is to attempt to predict the price of the car based on these factors.

To finish this experiment, let's take a look at Cross-Validation. We briefly touched on this topic in a previous post as it relates to Classification models. Let's dig a little deeper into it.

As we've mentioned before, it is extremely important to have a separation between Training data (data used to train the model) and Testing data (data used to evaluate the model). This separation is necessary because it allows us to determine how well our model could predict "new" data. In this case, "new" data is data that the model has not seen before. If we were to train the model using a data set, then evaluate the model using the same data set, we would have no way to determine whether they model is good at fitting "new" data or only good at fitting data it has already seen.

In practice, these data sets are often created by cleaning and preparing a single data set that contains all of the variables needed for modelling, as well as a column or columns containing the results we are trying to predict. Then, this data is split into two data sets. This process is generally random, with a larger portion of the data going to the training set than to the testing set. However, this methodology has a major flaw. How do we know if we got a bad sample? What if by random chance, our training data was missing a significant pattern that existed in our testing set, or vice-versa? This could cause us to inappropriately identify our model as "good" due to something that is entirely outside of our control. This is where Cross-Validation comes into play.

Cross-Validation is a process for creating multiple sets of testing and training sets, using the same set of data. Imagine that we split our data in half. For the first model, we train the model using the first half of the data, then we test the model using the second half of the data. Turns out that we can repeat this process by swapping the testing and training sets. So, we can train the second model using the second half of the data, then test the model using the first half of the data. Now, we have two models trained with different training sets and tested with different testing sets. Since the two testing sets did not contain any of the same elements, we can combine the scored results together to create a master set of scored data. This master set of scored data will have a score for every record in our original data set, without ever having a single model score a record that it was trained with. This greatly reduces the chances of getting a bad sample because we are effectively scoring every element in our data, not just a small portion.

Let's expand this method a little. First, we need to break our data into three sets. The first model is trained using sets 1 and 2, and tested using set 3. The second model is trained using sets 1 and 3, and tested using set 2. The third model is trained using sets 2 and 3, and tested using set 1. In practice, the sets are known as "folds". As you can see, we can extend this method out as far as we would like to create a master set of predictions.

K Fold Cross-Validation

There's even a subset of Cross-Validation known as "Leave One Out" where you train the model using all but one record, then score that individual record using the trained model. This process is obviously repeated for every record in the data set.

Now, the question becomes "How many folds should I have? 5? 10? 20? 1000?" This is a major question that some data scientists spend quite a bit of time working with. Fortunately for us, Azure ML automatically uses 10 folds, so we don't have to worry too much about this question. Let's see it in action.

If we look back at our previous post, we can see that the normalization did not improve any of our models. So, for simplicity, let's remove normalization.

Experiment So Far (No Normalization)

Now, let's take a look at the Cross Validate Model module.

Cross Validate Model

We see that this module takes an untrained model and a data set. It also requires us to choose which column we would like to predict, "Price" in our case. This module outputs a set of scored data, which looks identical to the scored data we get from the Score Data module, although it contains all of the records, not just those in the testing set. It also outputs a set of Evaluation Results by Fold.

Evaluation Results by Fold

This output shows a record for every fold in our Cross-Validation. Since there were ten folds, we have ten rows, plus additional rows for the mean (average) and standard deviation of each column. The columns in these results show some basic summary statistics for each fold. This data leans more heavily towards experienced data scientists. However, it can be easily used to recognize if a particular fold is strikingly different from the others. This could be helpful for determining whether there are subsets within our data that may contain completely different patterns than the rest of the data.

Next, we have to consider which model to input into the Cross-Validation. The issue here is that the Cross Validate Model module requires an untrained data set, while the Tune Model Hyperparameters module outputs a trained data set. So, in order to use our tuned models from the previous posts, we'll need to manually copy the tuned parameters from the Tune Model Hyperparameters module into the untrained model modules.

Online Gradient Descent Linear Regression Tuned Parameters

Online Gradient Descent Linear Regression

Next, we need to determine which models we want to compare. While it may be mildly interesting to compare the linear regression model trained with the 70/30 split to the linear regression model trained with Cross-Validation, that would be more of an example of the effect that Cross-Validation can have on the results. This doesn't have much business value. Instead, let's compare the results of the Cross-Validation models to see which model is best. Remember, the entire reason that we used Cross-Validation was to minimize the impact that a bad sample could have on our models. In the previous post, we found that the Poisson Regression model was the best fit.

Ordinary Least Squares Linear Regression vs Online Gradient Descent Linear Regression

Boosted Decision Tree Regression vs. Poisson Regression

We can see that Poisson Regression still has the highest Coefficient of Determination, meaning that we can once again determine that it is the best model. However, this leads us to another important question. Why is it important to perform Cross-Validation if it didn't change the result? The primary answer to this involves "trust". Most of what we do in the data science world involves math that most business people would see as voodoo. Therefore, one of the easiest paths to success is to gain trust, which can be found via a preponderance of evidence. The more evidence we can provide showing that our algorithm is reliable and accurate, the easier it will be us to convince other people to use it.

Hopefully, this series opened your mind as to the possibilities of Regression in Azure Machine Learning. Stay tuned for more posts. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Breaking BI

Monday, July 24, 2017

Azure Machine Learning in Practice: Fraud Detection

Monday, July 3, 2017

Azure Machine Learning: Cross-Validation for Regression