Monday, January 13, 2014

Predictive Analytics in Tableau Part 3: Validating the Accuracy of your Predictions

Today, we will talk about validating the accuracy of your predictions using Tableau 8.1's R functionality.  In the previous posts in this series, we showed you how to create statistical models to predict values for your business.  However, we ignored a very important issue.  A model is supposed to be pretty good at predicting values that you used when you created it.  However, the true test of a model is how well it can predict values it doesn't know the answer to.  For this examination, we will use the same data set and design as we used in the first post in the series, Predictive Analytics in Tableau Part 1: Simple Linear Regression.

When you are creating a model, you typically want to "hold out" a portion of your data to use for testing later.  For instance, if we hold out 25% of our data, then we only use the remaining 75% when we created our model.  The 75% is known as the "Training Set" because it is used to train the model.  The 25% is known as the "Testing Set" because it is used to test the model after it has been trained.  First, let's see how you would create these sets.
Consumer vs. Foreign
This is the data we have for creating our model.  As you can see, we only have 12 observations.  So, we would expect the training set to be 9 observations, and the testing set to be 3 observations.  Let's see how we would create this.
The code is not too complex.  Basically, it creates a 0/1 values for each year, with a 25% chance of 1 and 75% chance of 0.  Then, it changes the labels on these values to be more readable.  Now, let's see what it looks like on our chart.
Consumer vs. Foreign (with Set)
As you can see, it put 8 observations in our training set, and 4 observations in the testing set.  This isn't exactly 25%, but it's only one observation off.  You should be aware that Tableau is going to query the R server every time you refresh this chart.  This means that the observations in the training set WILL change.  This is exactly the way we want it to work.  We want our training set to be as similar as possible to our testing set, with the exception of the size. Now, let's see how we could create our regression model using only our training set.  The code for "Foreign (Predicted)" is below:

    cons <- .arg1
    fore <- .arg2
    set <- .arg3

    fore.train <- fore[set=='Training']
    cons.train <- cons[set=='Training']

    fit <- lm( fore.train ~ cons.train )
    dat <- data.frame(cbind(fore, cons))
    names(dat) <- c( 'fore.train', 'cons.train' )
    predict(fit, dat, interval = 'prediction')[,1]
, SUM( [CONSUMER] ), SUM( [FOREIGN] ), [Set] )

This code simply uses the training data to create the model, then predicts values for all of the data.  Let's see this on our scatterplot.
Consumer vs. Foreign (with Set, Prediction, and Trend)
This is just like the scatterplots we've been looking at, with a couple of additions.  First, we added our Regression model as the grey line.  Then, we added Tableau's Trend Line as the black line.  Our model was not trained using the testing data, but Tableau's trend line was.  So, what's the difference?  Not much according to this graph.  The lines are very close to one another.  This is one piece of evidence saying that this model predicts well.  Let's look at it differently.
Consumer vs. Foreign (with Set, Prediction Interval, and Trend)
Here, we see that not only do all of the training and testing values fall within the bands, but so does the trend line.  This is a very good sign.  For those of you that are less graphical and more numeric, let's see this on a table.
Consumer vs. Foreign (with Set, Prediction, and % Diff)
As you can see, our predictions can be off by as much as 35%.  You might ask "Why did we talk so highly about the model before when it predicts so poorly here?"  The answer is simple.  The model did the best it could with the amount of data it was given.  When you give a model ten values to look at, how accurate do you expect it to be?  The answer is not very.  However, in the business world, it's exceedingly rare to have extremely small data sets.  For instance, if you have monthly sales values for 5 years, that's 60 values.  You could easily hold out 10 or 15 of those and still get a good model.

There are no right answers when it comes to forecasting.  Everything is up to interpretation.  That's why so many companies put so much money into statisticians to develop accurate forecasts.  Even more to that point, perhaps linear regression isn't appropriate for this data set.  Maybe you would want to use a time series method, multiple regression, or a bayesian model.  Your predictive abilities are limited only by your data and your imagination.  We hope you found this informative.  Thanks for reading.

Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC


  1. Thanks your posts about working with R / Tableau, they have been really useful. One question about validating your model; for linear regression you would often use the R squared value to determine if the model performs well. Is it possible to get the R^2 as a variable in R / Tableau, so you can show it in a dashboard? Thanks!

    1. Absolutely. We'll be dealing with similar concepts in a later post. However, if you want to jump ahead, all you have to do is return the R^2 as a variable in R, then return it as using the rep() function. Unfortunately, I'm not sure off the top of my head how to return the R^2 in the easiest manner.

    2. Thanks for your quick reply, I'll try to use the rep function like you suggested and I'm looking foward to your future posts :-)