Monday, December 12, 2016

Azure Machine Learning: Classification Using Two-Class Logistic Regression

Today, we're going to continue looking at Sample 3: Cross Validation for Binary Classification Adult Dataset in Azure Machine Learning.  In the two previous posts, we looked at the Two-Class Averaged Perceptron and Two-Class Boosted Decision Tree algorithms.  The next algorithm in the experiment is Two-Class Logistic Regression.  Let's start by refreshing our memory on the data set.
Adult Census Income Binary Classification Dataset (Visualize)

Adult Census Income Binary Classification Dataset (Visualize) (Income)
This dataset contains the demographic information about a group of individuals.  We see the standard information such as Race, Education, Martial Status, etc.  Also, we see an "Income" variable at the end.  This variable takes two values, "<=50k" and ">50k", with the majority of the observations falling into the smaller bucket.  The goal of this experiment is to predict "Income" by using the other variables.  Let's take a look at the Two-Class Logistic Regression Tool.
Two-Class Logistic Regression
Logistic Regression is one of the more "mathematically pure" methods for Two-Class Prediction.  We'd imagine that virtually all statistics majors learn about this procedure in school.  Logistic Regression is a cousin of Linear Regression.  The main difference being that Linear Regression applies a linear function (ax + by + c) to predict a continuous value, while Logistic Regression uses a logit transformation to predict a binary value.  You can read more about the Logit here if you are interested.  Let's move on to the parameters.

As with many advanced machine learning algorithms, Two-Class Logistic Regression runs through the algorithm multiple times to ensure that we get the best predictions possible.  This means that the algorithm needs to know when to stop.  Most algorithms will stop whenever the new model doesn't significantly deviate from the old model.  This is called "Convergence".  The "Optimization Tolerance" parameter tells the algorithm how close the models have to be in order for it to stop.

The "L1 Regularization Weight" and "L2 Regularization Weight" are used to prevent overfitting.  We've talked about overfitting in-depth in the previous post.  They do this by penalizing models that contain extreme coefficients.

The "L1 Regularization Weight" parameter is useful for "sparse" datasetsWe'.  A dataset is considered sparse when every combination of variables is either poorly represented or not represented at all.  This is extremely common when dealing with data sets with a small number of observations and/or a large number of variables.

The "L2 Regularization Weight" parameter is useful for "dense" datasets.  A dataset is considered dense when every combination of variables is well represented.  This is common when dealing with data sets with a large number of observations and/or a small number of variables.  You can also think of "dense" as the opposite of "sparse".

The "Memory Size for L-BFGS" parameter determines how much history to store on previous iterations.  The smaller you set this number, the less history you will have.  This will lead to more efficient computation and weaker predictions.

Finally, the "Random Number Seed" parameter is useful if you want reproducable results.  If you want to learn more about the Two-Class Logistic Regression procedure or any of these parameters, read here and here.

We're not sure what you think, but we have no idea what to enter for most of these parameters.  Good thing Azure ML has an algorithm that can optimize these parameters for us.  Let's take a look at the "Tune Model Hyperparameters" tool.
Condensed Experiment

Tune Model Hyperparameters
The Tune Model Hyperparameters tool takes three inputs, an Untrained Model (Two-Class Logistic Regression in our case), a Training Dataset, and an optional Validation (or Testing) Dataset (which we don't have).  Ironically, this tool is designed to help us choose parameters, yet has some serious parameters of its own.  For now, we'll stick with the defaults and leave the discussion of this tool for a later post.  The one parameter we do need to set is the "Selected Columns" parameter so the algorithm knows which value we are trying to predict.  This tool has two outputs, "Sweep Results" and "Trained Best Model".  The "Sweep Results" output shows all of the different test runs, as well as their resulting metrics.  Let's take a look at the "Trained Best Model" visualization.
Tune Model Hyperparameters (Trained Best Model)
As you can see, this visualization tells us what the best set of parameters was.  However, this tool technically outputs a Trained Model.  The Cross-Validate Model tool that we are using as the end of our experiment requires an Untrained Model.  So, to avoid having to introduce another new tool in this post, we'll just write these parameters into the Two-Class Logistic Regression tool.
Two-Class Logistic Regression (Tuned Hyperparameters)
Now, let's take a look at the Contingency Table from the Cross-Validate Model tool, and compare it to the tables from the other posts.
Contingency Table (Two-Class Averaged Perceptron)

Contingency Table (Two-Class Boosted Decision Tree)
Contingency Table (Two-Class Logistic Regression)
As you can see, all three of the algorithms are close when it comes to correctly predicting "<=50k".  The story changes when it comes to ">50k".  While the Two-Class Logistic Regression algorithm is better than the Two-Class Averaged Perceptron algorithm, it isn't quite as good as the Two-Class Boosted Decision Tree.  We should point out that comparing these contingency tables is only a small part of choosing the "best" model.  We can go more in-depth on model selection in a later post.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
BI Engineer
Valorem Consulting
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, November 21, 2016

Azure Machine Learning: Classification Using Two-Class Boosted Decision Tree

Today, we're going to continue our walkthrough of Sample 3: Cross Validation for Binary Classification Adult Dataset.  In the previous post, we walked through the initial data load, as well as the Two-Class Averaged Perceptron algorithm.  Now, we're going to walk through the next algorithm, Two-Class Boosted Decision Tree.  Let's start with a simple overview of the experiment.
Sample 3: Cross Validation for Binary Classification Adult Dataset
The purpose of this data set is to take a dataset of Demographic data about individuals, and attempt to predict their income based on these factors.  Here's a snippet of that dataset.
Adult Census Income Binary Classification Dataset (Visualize)
Adult Census Income Binary Classification Dataset (Visualize) (Income)
If you want to learn more about the data import section of this experiment, check out the previous post.  Let's move on to the star of the show, Two-Class Boosted Decision Tree.  This is one of our favorite algorithms because it is incredibly simple to visualize, yet offers extremely powerful predictions.
Two-Class Boosted Decision Tree
This algorithm doesn't just construct one tree, it constructs as many as you want (100 in this case).  What's extremely interesting about these additional trees is that they are not independent of their predecessors.  According to MSDN, "the second tree corrects for the errors of the first tree, the third tree corrects for the errors of the first and second trees, and so forth."  This means that our trees should get better as we increase our "Number of Trees Constructed" parameter.  Unfortunately, this would mean that trees later in the process have a much higher risk of "Overfitting" than trees earlier in the process.  "Overfitting" is a situation where the model has been trained so heavily that it can extremely accurately predict your training data, but be very poor at predicting new observations.  Fortunately, the algorithm accounts for this by not just taking the prediction from the final tree in the set.  It takes predictions from every tree and averages them together.  This greatly lessens the effect of "Overfitting" while still providing accurate predictions.

The "Maximum Number of Leaves per Tree" parameter allows us to set the number of times the tree can split.  It's important to note that splits early in the tree are caused by the most significant predictors, while splits later in the tree are less significant.  This means that the more leaves you have (and therefore more splits), the higher your chance of overfitting is.  This is why Validation is so important.

The "Minimum Number of Samples per Leaf Node" parameters allows us to set the significance level required for a split to occur.  With this value set at 10, the algorithm will only choose to split (this known as creating a "new rule") if at least 10 rows, or observations, will be affected.  Increasing this value will lead to broad, stable predictions, while decreasing this value will lead to narrow, precise predictions.

The "Learning Rate" parameter allows us to set how much difference we see from tree to tree.  MSDN describes this quite well as "the learning rate determines how fast or slow the learner converges on the optimal solution. If the step size is too big, you might overshoot the optimal solution. If the step size is too small, training takes longer to converge on the best solution."

Finally, this algorithm lets us select a "Create Trainer Mode".  This is extremely useful if we can't decide exactly what parameters we want.  We'll take more about parameter selection in a later post.  If you want to learn more about this algorithm, read here and here.  Let's visualize this tool.
Two-Class Boosted Decision Tree (Visualize)
Just like with the Two-Class Averaged Perceptron algorithm, the visualization of the untrained model is not very informative.  Strangely enough, this visualization shows us the correct parameters, whereas the Two-Class Averaged Perceptron did not.  What would be far more interesting is if we could look at the trained tree.  In order to do this, we need to add a new tool to our experiment, Train Model.
Condensed Experiment
Train Model
The Train Model initialization is pretty simple.  All we need to do is select our variable of interest, which is "Income" in this case.  Let's take a look at the visualization.
Train Model (Visualization)
As you can see, this visualization lets you look through all the trees created in the training process.  Let's zoom in on a particular section of the tree.
Train Model (Visualization) (Zoom)
EDIT: At the time of writing this, there is a bug related to the display of predictions within Decision Trees.  Please see here for more details.

As you can see, each split in the tree relies on a single variable in a single expression, known as a predicate.  The first predicate says

marital-status.Married-civ-spouse <= 0.5

We've talked before about the concept of Dummy Variables.  When you pass a categorical variable to a numeric algorithm like this, it has to translate the values to numeric.  It does this by creating Dummy, or Indicator, Variables.  In this case, it created Dummy Variables for the "marital-status" variable.  One of these variables is "marital-status.Married-civ-spouse".  This variable takes a value of 1 if the observation has "marital-status = Married-civ-spouse" and 0 otherwise.  Therefore, this predicate is really just a numeric way of saying "Does this person have a Marital Status of "Married-Civ-Spouse".  We're not sure exactly what this means because this isn't our data set, but it's the most common variable in the dataset.  Therefore, it probably means being married and living together.

Under the predicate definition, we also see a value for "Split Gain".  This is a measure of how significant the split was.  A large value means a more significant split.  Since Google is our best friend, we found a very informative answer on StackOverflow explaining this.  You can read it here.

What we find very interesting about this tree structure is that it is not "balanced".  This means that in some cases, we can reach a prediction very quickly or very slowly depending on which side of the tree we are on.  We can see one prediction in Level 2 (the root level is technically considered Level 0.  This means that the 3rd level is considered Level 2).  We're not sure what causes the tree to choose whether to predict or split.  The MSDN article seems to imply that it's based on the combination of the "Minimum Number of Samples per Leaf Node" as well as some internal Information (or Split Gain) threshold.  Perhaps one of our readers can enlighten us about this.

Since we talked heavily about Cross-Validation in the previous post, we won't go into too much detail here.  However, it may be interesting to see the Contingency table to determine how well this model predicted our data.
Contingency Table (Two-Class Boosted Decision Tree)
Contingency Table (Two-Class Averaged Perceptron)
As you can see, the number of correct predictions for "Income <= 50k" is about the same between the two algorithms, but the Two-Class Boosted Decision Tree wins when it comes to the "Income > 50k" category.  We would need more analysis to make a sound decision, but we'll have to save that for a later post.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
BI Engineer
Valorem Consulting
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, October 31, 2016

Azure Machine Learning: Classification Using Two-Class Averaged Perceptron

Today, we're going to walk through Sample 3: Cross Validation for Binary Classification Adult Dataset.  So far, the Azure ML samples have been interesting combinations of tools meant for learning the basics.  Now, we're finally going to get some actual Data Science!  To start, here's what the experiment looks like.
Sample 3: Model Building and Cross-Validation
We had originally intended to skim over all of these models in a single post.  However, it was so interesting that we decided to break them out into separate posts.  So, we'll only be looking at the initial data import and the model on the far left, Two-Class Averaged Perceptron.  The first tool in this experiment is the Saved Dataset.
Adult Census Income Binary Classification Dataset
As usual, the first step of any analysis is to bring in some data.  In this case, they chose to use a Saved Dataset called "Adult Census Income Binary Classification dataset".  This is one of the many sample datasets that's available in Azure ML Studio.  You can find it by navigating to Saved Datasets on the left side of the window.
Sample Dataset
Let's take a peek at this data to see what we're dealing with.
Adult Census Income Binary Classification Dataset (Visualize)
As you can see, this dataset has about 32k rows and 15 columns.  These columns appear to be descriptions of people.  We have some of the common demographic data, such as age, education, and occupation, with the addition of an income field at the end.
Adult Census Income Binary Classification Dataset (Visualize) (Income)
This field takes two values, "<=50k" and ">50k", with the majority of people being in the lower bucket.  This would be a great dataset to do some predictive analytics on!  Let's move on to the next tool, Partition and Sample.
Partition and Sample
This is a pretty cool tool that allows you to trim down your rows in a few different ways.  We could easily spend an entire post talking about this tool; so we'll keep it brief.  You have four different options for "Partition or Sample Mode".  In this sample, they have selected "Sampling" with a rate of .2 (or 20%).  This allows us to take random samples from our data.  We also have the "Head" option which allows us to pass through the top N rows, which would be really good if we were debugging a large experiment and didn't want to wait for the sampling algorithm to run.  We also have the option to sample Folds, which is another name for a partition or subset of data.  Let's take a quick look at the visualization to see if anything else is going on.
Partition and Sample (Visualize)
We have the same 15 columns as before, the only difference is that we only 20% of the rows.  Nothing else seems be happening here.  Let's move on.
Clean Missing Data
In a previous post, we've spoken about the Clean Missing Data tool in more detail.  To briefly summarize, you can tell Azure ML how to handle missing (or null) values.  In this case, we are telling the algorithm to replace missing values in all columns with 0.  Let's move on to the star of the show, Model Building.  We'll start with the model on the far left, Two-Class Averaged Perceptron.
Two-Class Averaged Perceptron
This example is really interesting to us because we've never heard of it before writing this.  The Two-Class Averaged Perceptron algorithm is actually quite simple.  It takes a large number of numeric variables (it will automatically translate Categorical data into Numeric if you give it any.  These new variables are called Dummy Variables.).  Then, it multiplies the input variables by weights and adds them together to produce a numeric output.  That output is a score that can be used to choose between two different classes.  In our case, these classes are "Income <= 50k" and "Income > 50k".  Some of you might think this logic sounds very similar to a Neural Network.  In fact, the Two-Class Averaged Perceptron algorithm is a simple implementation of a Neural Network.

This algorithm gives us the option of providing three main parameters, "Create Trainer Mode", "Learning Rate" and "Maximum Number of Iterations".  "Learning Rate" determines how many steps the algorithm takes in order to calculate the "best" set of weights.  If the "Learning Rate" is too high (making the number of steps too low), the model will train very quickly, but the weights may not be a very good fit.  If the "Learning Rate" is too low (making the number of steps too high), the model will train very slowly, but could possibly produce "better" weights.  There are also concerns of Overfitting and Local Extrema to contend with.

 "Maximum Number of Iterations" determines how many times times the model is trained.  Since this is an Averaged Perceptron algorithm, you can run the algorithm more than once.  This will allow the algorithm to develop a number of different sets of weights (10 in our case).  These sets of weights can be averaged together to get a final set of weights, which can then be used to classify new values.  In practice, we could achieve the same result by creating 10 scores using the 10 sets of weights, then averaging the scores.  However, that method would seem to be far less efficient.

Finally, we have the "Create Trainer Mode" parameter.  This parameter allows us to pass in a single set of parameters (which is what we are currently doing) or pass in multiple sets of parameters.  You can find more information about this algorithm here and here.

This leaves us with a few questions that perhaps some readers could help us out with.  If you have 10 iterations, but set a specific random seed, does it create the same model 10 times, then average 10 identical weight vectors to get a single weight vector?  Does it use the random seed to create 10 new random seeds, which are then used to create 10 different weight vectors?  What happens if you define a set of 3 Learning Rates and 10 Iterations?  Will the algorithm run 30 iterations or will it break the iterations into sets of 3, 3, and 4 to accomodate each of the learning rates?  If you know the answers to these questions, please let us know in the comments.  Out of curiosity, let's see what's under Visualize for this tool.
Two-Class Averaged Perceptron (Visualize)
This is interesting.  These aren't the same parameters that we input into the tool, nor do they seem to be affected by our data stream.  This is a great opportunity to point out the way that models are built in Azure ML.  Let's take a look at data flow
Data Flow
I've rearranged the tools slightly to make it more obvious.  As you can see, the data does not flow into the model directly.  Instead, the model metadata is built using the model tool (Two-Class Averaged Perceptron in this class).  Then, the model metadata and the sample data are consumed by whatever tool we want to use downstream (Cross Validate Model in this case).  This means that we can reuse a model multiple times just by attaching it to different branches of a data stream.  This is especially useful when we want to use the same model against different data sets.  Let's move on to the final tool in this experiment, Cross Validate Model.
Cross Validate Model
Cross-Validation is a technique for testing, or "validating", a model.  Most people would test a model by using a Testing/Training split.  This means that we split our data into two separate sets, one for training the model and another one for testing the model.  This methodology is great because it allows us to test our model using data that it has never seen before.  However, this method is very susceptible to bias if we don't sample our data properly, as well as sample size issues.  This is where Cross-Validation comes in.

Imagine that we used the Testing/Training method to create a model using the Training data, then tested the model using the Testing data.  We could estimate how accurate our model is by seeing how well it predicts known values from our Testing data.  But, how do we know that we didn't get lucky?  How do we know there isn't some strange relationship in our data that caused our model to predict our Testing data well, but predict real-world data poorly?  To do this, we would want to train the model multiple times using multiple sets of data.  So, we separate our data into 10 sets.  Nine of the sets are used to train the model, and the remaining set is used to test.  We could repeat this process nine more times by changing which of the sets we use to test.  This would mean that we have created 10 separate models using 10 different training sets, and used them to predict 10 mutually exclusive testing sets.  This is Cross-Validation.  You can find out more about Cross-Validation here.  Let's see it in action.
Scored Results (1)
There are two outputs from the Cross Validate Model tool, Scored Results and Evaluation Results by Fold.  The Scored Results output shows us the same data that we passed into the tool, with 3 additional columns.  The first column, Fold Assignments, is added to the start of the data.  This tells us which of the 10 sets, or "folds", the row was sampled into.
Scored Results (2)
The remaining columns, Scored Labels and Scored Probabilities, are added to the end of the data.  The Scored Labels column tells us which category the model predicted this row would fall into.  This is what we were looking for all along.  The Scored Probability is a bit more complicated.  Mathematically, the algorithm wasn't trying to predict whether Income was "<=50k" or ">50k".  It was only trying to predict ">50k" because in a Two-Class algorithm, if you aren't ">50k", then you must be "<=50k".  If you looked down the Scored Probabilities column, you would see that all Scored Probabilities less than .5 have a Scored Label of "<=50k" and all Scored Probabilities greater than .5 have a Scored Label of ">50k".  If were using a Multi-Class algorithm, it would be far more complicated.  If you want to learn about the Two-Class Average Perceptron algorithm, read here and here.

There is one neat thing we wanted to show using this visualization though.
Scored Results (Comparison)
When we click on the "Income" column, a histogram will pop up on the right side of the window.  If we click on the "Compare To" drop-down, and select "Scored Labels", we get a very interesting chart.
Contingency Table
This is called a Contingency Table, also known as a Confusion Matrix or a Crosstab.  It shows you the distribution of your correct and incorrect predictions.  As you can see, our model is very good at predicting when a person has an Income of "<=50k", but not very good at predicting ">50k".  We could go much deeper into the concept of model validation, but this was an interesting chart that we stumbled upon here.  Let's look at the Evaluation Results by Fold.
Evaluation Results by Fold
This shows you a bunch of statistics about your Cross-Validation.  The purpose of this post was to talk about the Two-Class Averaged Perceptron, so we won't spend much time here.  However, don't be surprised if we make a full-length post about this in the future because there is a lot of information here.

We hope that this post sparked as much excitement in you as it did in us.  We're really starting to see just how much awesomeness is packed into Azure Machine Learning Studio; and we're so excited to keep digging.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
BI Engineer
Valorem Consulting
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, October 10, 2016

Azure Machine Learning: Edit Metadata, Clean Missing Data, Evaluate Probability Function, Select Columns and Compute Linear Correlation

Today, we're going to look at Sample 2: Dataset Processing and Analysis: Auto Imports Regression Dataset in Azure ML.  In our previous post, we took our first look inside Azure ML using a simple example.  Now, we're going to take it a step further with a larger experiment.  Let's start by looking at the whole experiment.
Sample 2
This definitely looks daunting at first glance.  So, let's break it down into smaller pieces.
Data Import
As you can see, this phase is made up of an HTTP Data Import, which we covered in the previous post.  However, this Edit Metadata tool is interesting.  Let's take a look inside.
Edit Metadata
In most tools, you don't "replace" columns by editing metadata.  For instance, if you were using a data manipulation tool like SSIS, you would have to use one tool to create the new columns with the new data types, then use another tool to remove the old columns.  This is not only cumbersome from a coding perspective, but it's also performs inefficiently because you have to carry those old columns to another tool after you no longer need them.

The Edit Metadata tool on the other hand, does both of these in one.  It allows you rename your columns, change data types, and even change them from categorical to non-categorical.  There's also a long-list of options in the "Fields" box to choose from.  We're not sure what any of these options do, but that sounds like a great topic for another post!  Alas, we're veering off-topic.

One of the major things we don't like about this tool is that it sets one set of changes for the entire list of columns.  That means that if you want multiple sets of changes, you need to use the tool multiple times.  Fortunately, this experiment only uses it twice.  Before we move on, let's take a look at the data coming out of the second Edit Metadata tool.
Data Before Analysis
The two edit metadata tools altered columns 1, 2 and 26, giving them names and making them numeric.  This leads us with one huge question.  Why did they not rename the rest of the columns?  Do they not intend on using them later or were they just showcasing functionality?  Guess we'll just have to find out.

Let's move on to the left set of tools.
Left Set
This set starts off with a Clean Missing Data tool.  Let's see what it does.
Clean Missing Data (Left)
As you can see, this tool takes all numeric columns, and substitutes 0 any time that the value is missing.  The process of replacing missing values is called Imputation and it's a big deal in the data science world.  You can read up on it here.  This tool also has the option to generate a missing value indicator column.  This means that any row where a value was imputed would a value of 1 (or TRUE) and all other rows would have a value of 0 in the column.  While we would love to go in-depth about imputation, it's a subject we'll have to reserve for a later post.

In our opinion, the coolest part about this tool has nothing to do with its goal.  It has one of the best column selection interfaces we've ever seen.  It allows you to programatically add or remove columns from your dataset. Looking at some of the other tools, this column selection interface pops in up in quite a few of them.  This makes us very happy.
Select Columns
We'll skip over the Summarize Data tool, as that was covered in the previous post.  Let's move on to the Evaluate Probability Function tool.
Evaluate Probability Function (Left)
There are a couple of things to note about this tool.  First, it lets you pick from a very large list of distributions.
Distributions
It chose to use the Normal Distribution, which is what people are talking about when they say "Bell-Shaped Curve".  It's definitely the most common used algorithm by beginner data scientists.  Next, it lets you select which method you would like to use.
Method
There are three primary methods in statistics, the Probability Density Function (PDF), Cumulative Distribution Function (CDF), and Inverse Cumulative Distribution Function (Inverse CDF).  If you want to look these up on your own, you can use the following links: PDF, CDF, InverseCDF.

You can get entire degrees just by learning these three concepts and we won't even attempt to explain them in a sentence.  Simply put, if you want to see how likely (or unlikely) an observation is, use the CDF.  We'll leave the other two for you to research on your own.  Let's move on to the Select Columns tool.

Select Columns (Left)
This tool is about as simple as it comes.  You tell it which columns you want (using that awesome column selector) and it throws the rest away.  Let's move on to the final tool, Compute Linear Correlation.
Compute Linear Correlation (Left)
This is another simple tool.  You give it a data set with a bunch of numeric columns, it spits out a matrix of Pearson correlations.
Linear (Pearson) Correlations (Left)
These are the traditional correlations you learned back in high school.  A correlation of 0 means that there is no linear relationship and a correlation of 1 or -1 means there is a perfect linear correlation.

We were planning on moving on to the other legs of this experiment.  However, they are just slightly altered versions of what we went through.  This is an interesting sample for Microsoft to publish, as it doesn't seem to have any real analytical value.  We can't imagine using this is any sort of investigative or analytic scenario.  What is it good for then?  Learning Azure ML!  This is exactly what we used it for.  It showed us some neat features and prompted a few good follow-up posts to get more in-depth on some of these tools.  Hopefully, it sparked some good ideas in you guys too.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
BI Engineer
Valorem Consulting
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com