Today, we're going to walk through Sample 3: Cross Validation for Binary Classification Adult Dataset. So far, the Azure ML samples have been interesting combinations of tools meant for learning the basics. Now, we're finally going to get some actual Data Science! To start, here's what the experiment looks like.
|
Sample 3: Model Building and Cross-Validation |
We had originally intended to skim over all of these models in a single post. However, it was so interesting that we decided to break them out into separate posts. So, we'll only be looking at the initial data import and the model on the far left, Two-Class Averaged Perceptron. The first tool in this experiment is the Saved Dataset.
|
Adult Census Income Binary Classification Dataset |
As usual, the first step of any analysis is to bring in some data. In this case, they chose to use a Saved Dataset called "Adult Census Income Binary Classification dataset". This is one of the many sample datasets that's available in Azure ML Studio. You can find it by navigating to Saved Datasets on the left side of the window.
|
Sample Dataset |
Let's take a peek at this data to see what we're dealing with.
|
Adult Census Income Binary Classification Dataset (Visualize) |
As you can see, this dataset has about 32k rows and 15 columns. These columns appear to be descriptions of people. We have some of the common demographic data, such as age, education, and occupation, with the addition of an income field at the end.
|
Adult Census Income Binary Classification Dataset (Visualize) (Income) |
This field takes two values, "<=50k" and ">50k", with the majority of people being in the lower bucket. This would be a great dataset to do some predictive analytics on! Let's move on to the next tool, Partition and Sample.
|
Partition and Sample |
This is a pretty cool tool that allows you to trim down your rows in a few different ways. We could easily spend an entire post talking about this tool; so we'll keep it brief. You have four different options for "Partition or Sample Mode". In this sample, they have selected "Sampling" with a rate of .2 (or 20%). This allows us to take random samples from our data. We also have the "Head" option which allows us to pass through the top N rows, which would be really good if we were debugging a large experiment and didn't want to wait for the sampling algorithm to run. We also have the option to sample Folds, which is another name for a partition or subset of data. Let's take a quick look at the visualization to see if anything else is going on.
|
Partition and Sample (Visualize) |
We have the same 15 columns as before, the only difference is that we only 20% of the rows. Nothing else seems be happening here. Let's move on.
|
Clean Missing Data |
In a previous
post, we've spoken about the Clean Missing Data tool in more detail. To briefly summarize, you can tell Azure ML how to handle missing (or null) values. In this case, we are telling the algorithm to replace missing values in all columns with 0. Let's move on to the star of the show, Model Building. We'll start with the model on the far left, Two-Class Averaged Perceptron.
|
Two-Class Averaged Perceptron |
This example is really interesting to us because we've never heard of it before writing this. The Two-Class Averaged Perceptron algorithm is actually quite simple. It takes a large number of numeric variables (it will automatically translate Categorical data into Numeric if you give it any. These new variables are called
Dummy Variables.). Then, it multiplies the input variables by weights and adds them together to produce a numeric output. That output is a score that can be used to choose between two different classes. In our case, these classes are "Income <= 50k" and "Income > 50k". Some of you might think this logic sounds very similar to a Neural Network. In fact, the Two-Class Averaged Perceptron algorithm is a simple implementation of a Neural Network.
This algorithm gives us the option of providing three main parameters, "Create Trainer Mode", "Learning Rate" and "Maximum Number of Iterations". "Learning Rate" determines how many steps the algorithm takes in order to calculate the "best" set of weights. If the "Learning Rate" is too high (making the number of steps too low), the model will train very quickly, but the weights may not be a very good fit. If the "Learning Rate" is too low (making the number of steps too high), the model will train very slowly, but could possibly produce "better" weights. There are also concerns of
Overfitting and
Local Extrema to contend with.
"Maximum Number of Iterations" determines how many times times the model is trained. Since this is an Averaged Perceptron algorithm, you can run the algorithm more than once. This will allow the algorithm to develop a number of different sets of weights (10 in our case). These sets of weights can be averaged together to get a final set of weights, which can then be used to classify new values. In practice, we could achieve the same result by creating 10 scores using the 10 sets of weights, then averaging the scores. However, that method would seem to be far less efficient.
Finally, we have the "Create Trainer Mode" parameter. This parameter allows us to pass in a single set of parameters (which is what we are currently doing) or pass in multiple sets of parameters. You can find more information about this algorithm
here and
here.
This leaves us with a few questions that perhaps some readers could help us out with. If you have 10 iterations, but set a specific random seed, does it create the same model 10 times, then average 10 identical weight vectors to get a single weight vector? Does it use the random seed to create 10 new random seeds, which are then used to create 10 different weight vectors? What happens if you define a set of 3 Learning Rates and 10 Iterations? Will the algorithm run 30 iterations or will it break the iterations into sets of 3, 3, and 4 to accomodate each of the learning rates? If you know the answers to these questions, please let us know in the comments. Out of curiosity, let's see what's under Visualize for this tool.
|
Two-Class Averaged Perceptron (Visualize) |
This is interesting. These aren't the same parameters that we input into the tool, nor do they seem to be affected by our data stream. This is a great opportunity to point out the way that models are built in Azure ML. Let's take a look at data flow
|
Data Flow |
I've rearranged the tools slightly to make it more obvious. As you can see, the data does not flow into the model directly. Instead, the model metadata is built using the model tool (Two-Class Averaged Perceptron in this class). Then, the model metadata and the sample data are consumed by whatever tool we want to use downstream (Cross Validate Model in this case). This means that we can reuse a model multiple times just by attaching it to different branches of a data stream. This is especially useful when we want to use the same model against different data sets. Let's move on to the final tool in this experiment, Cross Validate Model.
|
Cross Validate Model |
Cross-Validation is a technique for testing, or "validating", a model. Most people would test a model by using a Testing/Training split. This means that we split our data into two separate sets, one for training the model and another one for testing the model. This methodology is great because it allows us to test our model using data that it has never seen before. However, this method is very susceptible to bias if we don't sample our data properly, as well as sample size issues. This is where Cross-Validation comes in.
Imagine that we used the Testing/Training method to create a model using the Training data, then tested the model using the Testing data. We could estimate how accurate our model is by seeing how well it predicts known values from our Testing data. But, how do we know that we didn't get lucky? How do we know there isn't some strange relationship in our data that caused our model to predict our Testing data well, but predict real-world data poorly? To do this, we would want to train the model multiple times using multiple sets of data. So, we separate our data into 10 sets. Nine of the sets are used to train the model, and the remaining set is used to test. We could repeat this process nine more times by changing which of the sets we use to test. This would mean that we have created 10 separate models using 10 different training sets, and used them to predict 10 mutually exclusive testing sets. This is Cross-Validation. You can find out more about Cross-Validation
here. Let's see it in action.
|
Scored Results (1) |
There are two outputs from the Cross Validate Model tool, Scored Results and Evaluation Results by Fold. The Scored Results output shows us th
e same data that we passed into the tool, with 3 additional columns. The first column, Fold Assignments, is added to the start of the data. This tells us which of the 10 sets, or "folds", the row was sampled into.
|
Scored Results (2) |
The remaining columns, Scored Labels and Scored Probabilities, are added to the end of the data. The Scored Labels column tells us which category the model predicted this row would fall into. This is what we were looking for all along. The Scored Probability is a bit more complicated. Mathematically, the algorithm wasn't trying to predict whether Income was "<=50k" or ">50k". It was only trying to predict ">50k" because in a Two-Class algorithm, if you aren't ">50k", then you must be "<=50k". If you looked down the Scored Probabilities column, you would see that all Scored Probabilities less than .5 have a Scored Label of "<=50k" and all Scored Probabilities greater than .5 have a Scored Label of ">50k". If were using a Multi-Class algorithm, it would be far more complicated. If you want to learn about the Two-Class Average Perceptron algorithm, read
here and
here.
There is one neat thing we wanted to show using this visualization though.
|
Scored Results (Comparison) |
When we click on the "Income" column, a histogram will pop up on the right side of the window. If we click on the "Compare To" drop-down, and select "Scored Labels", we get a very interesting chart.
|
Contingency Table |
This is called a Contingency Table, also known as a Confusion Matrix or a Crosstab. It shows you the distribution of your correct and incorrect predictions. As you can see, our model is very good at predicting when a person has an Income of "<=50k", but not very good at predicting ">50k". We could go much deeper into the concept of model validation, but this was an interesting chart that we stumbled upon here. Let's look at the Evaluation Results by Fold.
|
Evaluation Results by Fold |
This shows you a bunch of statistics about your Cross-Validation. The purpose of this post was to talk about the Two-Class Averaged Perceptron, so we won't spend much time here. However, don't be surprised if we make a full-length post about this in the future because there is a lot of information here.
We hope that this post sparked as much excitement in you as it did in us. We're really starting to see just how much awesomeness is packed into Azure Machine Learning Studio; and we're so excited to keep digging. Thanks for reading. We hope you found this informative.
Brad Llewellyn
BI Engineer
Valorem Consulting
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com