Breaking BI: Predictive Analytics in Tableau Part 4: Logistic Regression

Monday, January 20, 2014

Predictive Analytics in Tableau Part 4: Logistic Regression

Today, we're going to talk about performing Logistic Regression using Tableau 8.1's R functionality. Logistic Regression is very similar to Linear Regression, which we saw in the previous posts in this series. However, Logistic Regression is designed to predict binary (Yes/No, 1/0) outcomes. A very simple example is "Will this customer buy our product if we advertise to them?" For this exercise, we will use the ubiquitous AdventureWorks data set from Microsoft. If you've ever seen a Microsoft Data Mining demo, you've seen this data set. Let's start by looking at our data.

Customer Demographics

As you can see, we have quite a bit of information about our customers, as well as whether or not they purchased a bike from us in the past. Now, let's look at a logistic model with only one predictor so that we can understand how it works.

Purchased Bike (Predicted by Age)

As you can see, this code is extremely similar to the code for creating a linear regression model. The only differences are that this model uses a more complex function, glm(), and an extra parameter, family = binomial( logit ). Now, let's see what the predictions look like.

Predictions by Age

Some of you will immediately ask why this function returns decimals when we asked it to predict a Yes/No response. Strictly speaking, logistic regression does not predict a Yes/No response, it predicts the probability of a particular response. In other words, it tells us how likely this person is to buy a bike. It's up to us to decide how we want to use these probabilities. Let's look at these predictions in another way.

Predictions by Age (Scatterplot)

As you can see, the probability of buying a bike decreases as the customer gets older. This is an important, and not very surprising, discovery. Now, how do we turn these probability into actual predictions? That's up to us! An easy way is to say "If the chance is greater than 50%, we say they will buy. If it's less than 50%, we say they won't." Let's see what this gets us.

Predictions by Age (Classified)

This procedure doesn't seem to be very accurate. Perhaps it's because we're only giving it one predictor. Let's throw the rest of our variables in there and see if it gets better.

Predictions (Classified)

These predictions are much better, but still not as accurate as we'd like. Unfortunately, we couldn't find an easy way to look at these predictions in aggregate using this method. So far, it seems that the new R implementation is really good for generating predictions. However, it falls a little short when it comes to examining the model. Fortunately, there's even more we could do here. We could try a different model for this data, such as an artificial neural network or a bayesian model. Maybe one of our readers can find a really neat way to display this data that sums it up nicely. The rest is up to your imagination. We hope that you found this informative. Thanks for reading.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

8 comments:

Matt SlaughterJanuary 22, 2014 at 12:23 PM
Nice write-up, Brad! I am struggling with how to use Tableau and R to build a predictor model from known (training) data and then apply the model to the unknown data. Have you come across an easy way to do that?
ReplyDelete
Replies
Jim GibbonNovember 15, 2014 at 1:25 PM
Hi Brad,
Do you know if the arguments in an R script can come from multiple data sources? I created a large crosstab view to make sure all my fields were aggregating correctly, but to do so I blended a lot of different sources (we'll eventually have a single Oracle table underlying the model, but not yet). I now get an error that I didn't see when running models using just one data source:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Any clue if this might be tied to using blended data sources?
ReplyDelete
Replies
Jim GibbonNovember 15, 2014 at 3:38 PM
This comment has been removed by the author.
ReplyDelete
Replies
Jim GibbonNovember 15, 2014 at 4:00 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 13, 2015 at 12:12 PM
Hi Brad

Txs for the blog posts. those are v-helpful

Can you kindly point me to the exact location of the AdventureWorks DataSet version you are using or better the Tableau WookBook of this part-4 episode ?

txs v-much

Jimmy
ReplyDelete
Replies

Add comment

About Me

Brad is a Service Engineer on Microsoft's FastTrack for Azure Team in Charlotte, NC. Brad helps individuals and organizations leverage Analytics and Azure to revolutionize themselves and their industries. He has an M.S. in Statistics from the University of South Carolina, MCSE Certification in Data Management and Analytics, MCSE Certification in Cloud Platform and Infrastructure and various MCSA Certifications in Business Intelligence and Advanced Analytics. Brad is an active blogger at breaking-bi.blogspot.com. He is also an organizer for the Charlotte BI Group, a local PASS chapter in Charlotte, NC. You can connect with him on LinkedIn at https://www.linkedin.com/in/bradllewellyn and on Twitter @BreakingBI.