## Monday, October 13, 2014

### Data Mining in Excel Part 24: Logistic Regression

Today, we're going to start talking about the last few algorithms in the Microsoft Data Mining Stack.  These algorithms can only be accessed via the "Add Model to Structure" tool.
 Add Model to Structure
Specifically, this post will be talking about the Logistic Regression algorithm.  Logistic Regression is designed to predict Binary (Yes/No, 1/0) values.  To find out more about Logistic Regression, read this.  However, Microsoft's algorithm expands the scope of the procedure to allow it predict more than just 2 values.  It is closely related to the Decision Tree algorithm, which is where we going to start.  In a previous post, we built a classification model for predicting Purchased Bike.  Let's add another model to that structure using the Logistic Regression algorithm and see what we get.
 Select Structure
The first step is to select the "Classify Purchased Bike" structure.
 Select Algorithm
Next, we need to select the "Microsoft Logistic Regression" algorithm.  Let's see what kind of parameters we have.
 Parameters
Compared to the other algorithms, Logistic Regression has much fewer parameters.  In fact, none of the parameters alter the methodology behind how the model is built.  They all affect what data the model sees when it is being trained.  For more information on these parameters, read this.  Let's move on.
 Select Columns
Now, we need to select which columns we will use for what purposes.  Since this is a simple question, we want to use every column except ID (called _RowIndex here for some reason) and Purchased Bike, to predict Purchased Bike.
 Create Model
Finally, we create the model inside the existing structure.  It's interesting to note that this algorithm does not allow drillthrough.  You'll understand why as soon as you see the Browse window.
 Browse
This is the only option we have for viewing this model.  Many of you may recognize this as a Discrimination Report.  This report shows us which input values seem to correspond with certain output values.  For instance, we see that people with 5 children, 4 cars, or 3 cars don't usually buy bikes.  However, the odds of selling to customers with high income or that live in the Pacific are much higher.  Obviously, there's nowhere to click here to be able to drillthrough, which is why the option was no selectable before.

Our customization options lie in the upper-right corner of the window.  Here, we can select which output variable we want to look at, assuming we have more than one, and which values within that variable we want to compare.  In our case, we had one variable with two values, so the choice is pretty simple.  We can also choose to look at a subset of our population by applying filters in the upper-left corner of the window.

You might be wondering why anyone would use this algorithm when the Decision Tree algorithm seems to be much more informative.  In fact, the Decision Tree algorithm is built with some of the same logic as the Logistic Regression algorithm, just in a more complex manner.  This is especially apparent because we don't really have any parameters to tweak this algorithm to our liking.  Well, if you were working with enormous data sets, the Decision Tree algorithm may have significant performance issues.  So, one methodology for professional data miners is to build a lot of candidate models using different parameters and different algorithms.  Then, they compare the models to see which give better results with less work.  For instance, if your Decision Tree model gives very similar predictions to your Logistic Regression model, but takes all day to run, then you should consider throwing the Decision Tree model away, or finding a combination of parameters that gives you a better prediction.  When you're dealing with enterprise-wide Data Mining operations, time is money and should be considered.  Stay tuned for our next post where we'll be talking about Naive Bayes.  Thanks for reading.  We hope you found this informative.