When you are creating a model, you typically want to "hold out" a portion of your data to use for testing later. For instance, if we hold out 25% of our data, then we only use the remaining 75% when we created our model. The 75% is known as the "Training Set" because it is used to train the model. The 25% is known as the "Testing Set" because it is used to test the model after it has been trained. First, let's see how you would create these sets.
Consumer vs. Foreign |
Set |
The code is not too complex. Basically, it creates a 0/1 values for each year, with a 25% chance of 1 and 75% chance of 0. Then, it changes the labels on these values to be more readable. Now, let's see what it looks like on our chart.
Consumer vs. Foreign (with Set) |
As you can see, it put 8 observations in our training set, and 4 observations in the testing set. This isn't exactly 25%, but it's only one observation off. You should be aware that Tableau is going to query the R server every time you refresh this chart. This means that the observations in the training set WILL change. This is exactly the way we want it to work. We want our training set to be as similar as possible to our testing set, with the exception of the size. Now, let's see how we could create our regression model using only our training set. The code for "Foreign (Predicted)" is below:
SCRIPT_REAL( "
cons <- .arg1
fore <- .arg2
set <- .arg3
fore.train <- fore[set=='Training']
cons.train <- cons[set=='Training']
fit <- lm( fore.train ~ cons.train )
dat <- data.frame(cbind(fore, cons))
names(dat) <- c( 'fore.train', 'cons.train' )
predict(fit, dat, interval = 'prediction')[,1]
"
, SUM( [CONSUMER] ), SUM( [FOREIGN] ), [Set] )
This code simply uses the training data to create the model, then predicts values for all of the data. Let's see this on our scatterplot.
Consumer vs. Foreign (with Set, Prediction, and Trend) |
This is just like the scatterplots we've been looking at, with a couple of additions. First, we added our Regression model as the grey line. Then, we added Tableau's Trend Line as the black line. Our model was not trained using the testing data, but Tableau's trend line was. So, what's the difference? Not much according to this graph. The lines are very close to one another. This is one piece of evidence saying that this model predicts well. Let's look at it differently.
Consumer vs. Foreign (with Set, Prediction Interval, and Trend) |
Here, we see that not only do all of the training and testing values fall within the bands, but so does the trend line. This is a very good sign. For those of you that are less graphical and more numeric, let's see this on a table.
Consumer vs. Foreign (with Set, Prediction, and % Diff) |
As you can see, our predictions can be off by as much as 35%. You might ask "Why did we talk so highly about the model before when it predicts so poorly here?" The answer is simple. The model did the best it could with the amount of data it was given. When you give a model ten values to look at, how accurate do you expect it to be? The answer is not very. However, in the business world, it's exceedingly rare to have extremely small data sets. For instance, if you have monthly sales values for 5 years, that's 60 values. You could easily hold out 10 or 15 of those and still get a good model.
There are no right answers when it comes to forecasting. Everything is up to interpretation. That's why so many companies put so much money into statisticians to develop accurate forecasts. Even more to that point, perhaps linear regression isn't appropriate for this data set. Maybe you would want to use a time series method, multiple regression, or a bayesian model. Your predictive abilities are limited only by your data and your imagination. We hope you found this informative. Thanks for reading.
There are no right answers when it comes to forecasting. Everything is up to interpretation. That's why so many companies put so much money into statisticians to develop accurate forecasts. Even more to that point, perhaps linear regression isn't appropriate for this data set. Maybe you would want to use a time series method, multiple regression, or a bayesian model. Your predictive abilities are limited only by your data and your imagination. We hope you found this informative. Thanks for reading.
Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn
Thanks your posts about working with R / Tableau, they have been really useful. One question about validating your model; for linear regression you would often use the R squared value to determine if the model performs well. Is it possible to get the R^2 as a variable in R / Tableau, so you can show it in a dashboard? Thanks!
ReplyDeleteAbsolutely. We'll be dealing with similar concepts in a later post. However, if you want to jump ahead, all you have to do is return the R^2 as a variable in R, then return it as using the rep() function. Unfortunately, I'm not sure off the top of my head how to return the R^2 in the easiest manner.
DeleteThanks for your quick reply, I'll try to use the rep function like you suggested and I'm looking foward to your future posts :-)
Delete