Breaking BI: Predictive Analytics in Tableau Part 2: Linear Regression with Multiple Regressors

Monday, January 6, 2014

Predictive Analytics in Tableau Part 2: Linear Regression with Multiple Regressors

Today, we will talk about using Tableau 8.1's R functionality to perform predictive analysis via Multiple Regression. In our previous post in the series, Predictive Analytics in Tableau Part 1: Simple Linear Regression, we talked about Simple Linear Regression. Now, we're moving a step up and adding multiple variables to the mix. However, we're going to keep it simple for now and keep all of the regressors at degree 1. For those of you without advanced mathematics, which is probably most of you, a variable of degree 1 is linear, i.e. a straight line. When the degrees get higher, the function get more curves. If you want to know more, you can check out this article. For this analysis, we will use the same data set as in the previous post, which can be found here.

To begin, let's look back at our scatterplot matrix again to see what our data looks like.

Scatterplot Matrix

Initially, we can just throw everything against the wall and see what comes out. We don't particularly care about knowing what the relationship is, we just want some predictions. In these simple scenarios, it is safe to assume that more data is always better. However, it's not always the case with real data. But, we'll deal with that in a later post. For now, let's just put everything in the model. The code is too long to fit in one screenshot; so I have pasted it below:

SCRIPT_REAL( "

## Defining Variables

cons <- .arg1

crud <- .arg2

djia <- .arg3

fore <- .arg4

gnp <- .arg5

inte <- .arg6

purc <- .arg7

## Fitting the Model

fit <- lm( cons ~ crud + djia + fore + gnp + inte + purc )

fit$fitted

, SUM( [CONSUMER] ), SUM( [CRUDE] ), SUM( [DJIA] ), SUM( [FOREIGN] ),

SUM( [GNP] ), SUM( [INTEREST] ), SUM( [PURCHASE] ) )

Now, let's see the results.

Predicted Consumer Debt by Year (Text Table)

Just like in our previous post, this text table isn't very easy to read. However, if you look closely, you will see that the predictions are significantly closer since we included all of the variables. Now, we're stuck with another dilemma. How do we see the results visually? We can't have a seven-dimensional scatterplot. Well, there are a couple of different ways. First, let's look at what's called a Residual vs. Predicted plot.

Mathematically, the residuals are the differences between the actual value and the predicted value. In the business world, you'll often hear this called a "Delta."

Consumer (Residual)

Now, imagine that our model fit our data extremely well. We wouldn't expect that the model fit perfectly. However, we would expect the model to remove most of the "systematic" variation within the data, leaving only random noise. This is precisely what this plot is designed to see.

Residual vs. Predicted

Reading these types of charts is more art than science. But, you can ask yourself one question, "Does the data form any significant pattern?" We would say no. Therefore, we believe that our model is a good fit for this data. Now, let's visualize the data in its original context, by year. Remember in the last post where we talked about "Prediction Intervals"? We can make those here as well, using almost identical code. The code for Consumer (Predicted Lower) is pasted below:

SCRIPT_REAL( "

## Defining Variables

cons <- .arg1

crud <- .arg2

djia <- .arg3

fore <- .arg4

gnp <- .arg5

inte <- .arg6

purc <- .arg7

## Fitting the Model

fit <- lm( cons ~ crud + djia + fore + gnp + inte + purc )

## Creating the Prediction Interval

dat <- data.frame(cbind(cons,crud,djia,fore,gnp,inte,purc))

predict(fit, dat, interval = 'prediction')[,2]

, SUM( [CONSUMER] ), SUM( [CRUDE] ), SUM( [DJIA] ), SUM( [FOREIGN] ),

SUM( [GNP] ), SUM( [INTEREST] ), SUM( [PURCHASE] ) )

To find the upper bound, you simply need to change the ,2 to a ,3. Finally, let's plot our data.

Predicted Consumer Debt by Year (Line Chart)

As you can see, the bounds fit the actual data (blue line) very tightly and follow it as it increases over time. Now, some of the more knowledgeable readers might say, "Multiple Regression requires uncorrelated observations, not time series data!" You would be correct. However, our goal here was simply to predict Consumer Debt. We don't care about using the model to show correlation/causation between these variables. We'll leave that to the econometricians. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

18 comments:

AnonymousJanuary 6, 2014 at 10:48 AM
Thanks for this series. I do have a question however. When writing the calc for the prediction intervals, what do the[,2] and [,3] represent? It obviously gives the lower and upper interval, but what specifically does it mean?

Thanks,
ReplyDelete
Replies
UnknownMay 12, 2014 at 6:04 AM
Good day Sir,

I am new to the R language, I just started to educate myself with it just this year. I was trying to perform a linear model with multiple variables with my Tableau. My syntax is:

SCRIPT_REAL( ”

## Defining Variables

[GVA]<- .arg1,
[Emp]<- .arg2,
[Surv]<- .arg3,
[CP]<- .arg4,

## Fitting the Model

fit <- lm( GVA ~ Emp + CP + Surv)
fit$fitted
"
,SUM( [GVA]), SUM([CP]), SUM([Emp]), SUM([Surv]))

but I received an error saying:

Error in base::parse(text = .cmd) : :5:5: unexpected ‘[‘
4:
5: [
^
What seems to be the problem? Thank you so much.
ReplyDelete
Replies
AnonymousJune 19, 2014 at 11:44 AM
Thanks very much for the informative session. I'm having a bit of a problem. If I would like to see each of the predicted y value, how would I be able to do that? It seems like Tableau is putting everything into the SUM, and if I don't want to see it by year, rather, by company or by field, it stops working?
ReplyDelete
Replies
AnonymousJune 19, 2014 at 5:42 PM
Thank you Brad. Your blog was helpful. is it possible to view a summary of the linear regression? I am thinking the R function "summary(lm)" in tableau.
ReplyDelete
Replies
AnonymousSeptember 16, 2014 at 2:38 PM
Hi Brad,

Your blog is awesome, it's helping me a lot.

I'm having issues with the multiple regression procedure in Tableau.

I am using the following code:

SCRIPT_REAL("
Score <- .arg1
A <- .arg2
B <- .arg3
C <- .arg4
D <- .arg5
E <- .arg6
F <- .arg7
G <- .arg8

fit <- lm( OSAT ~ A + B + C + D + E + F + G )
fit$fitted
"
, avg([Score]), avg([A]), avg([B]), avg([C]), avg([D]), avg([E]), avg([F]), avg([G]))

What I actually need is the coefficient score and the P-value for each of those arguments (A to G)

So I tried changing the fit$fitted for something like this:

fit$coefficients

Also, the data frame I am using come from a survey database. It looks like this:

Respondent # | Score | A | B | C | D | E | F | G|
#1 | 10 | 9 | 9 | 8 | 7 | 6 | 7 | 5
#2 | 5 | 4 | 2 | 3 | 5 | 7 | 8 | 6
...
i

So what I am trying to achieve is to get the coefficients and p-value on a single sheet for all those arguments (A to G) in order to make a scatter plot.

Thanks a lot

ReplyDelete
Replies
AnonymousSeptember 16, 2014 at 7:46 PM
Good to know it's the most frustrating part, I thought maybe I was the only one facing this issue!

I am not sure what you mean by doing rep( P, 8 ). What I am trying to achieve is to get those values (the 8 coefficients and 8 p-values) in a single table in Tableau. The best way would be to get a matrix from R but since we get only a single value I am wondering how I can modify my code in order to get those values into Tableau.

The visulization I'm looking for is a bubble chart:
X axis = average Score
Y axis = Coefficient
The bubble would be A-B-C-D-E-F-G according to their average score and coefficient.
And I would filter by p-value > 0.05 to show only the attributes that are statistically significant.

Tell me if I’m not being clear enough, I can send you an example of what I’m trying to achieve.

Thanks,

Gabriel
ReplyDelete
Replies
AnonymousNovember 3, 2015 at 7:24 PM
How can I show the regression model's coefficients in Tableau? eg. We can use something like summary(fit)$coefficients[,1] in R to show one of the coefficients, but how to show it in Tableau? Thanks!
ReplyDelete
Replies
AnonymousNovember 19, 2015 at 12:57 AM
I keep getting a perfect model fit & my residuals are zero. I have a large data set & purposefully duplicated all variables in one row with two different y values..& it STILL predicts perfectly. I'm obviously doing something wrong. Can you help?
ReplyDelete
Replies
AnonymousNovember 19, 2015 at 1:02 AM
I keep getting a perfect model fit & my residuals are zero. I have a large data set & purposefully duplicated all variables in one row with two different y values..& it STILL predicts perfectly. I'm obviously doing something wrong. Can you help?
ReplyDelete
Replies
RedDecember 11, 2015 at 11:42 AM
I keep getting a perfect model fit & my residuals are zero. what is wrong witn my calculating? thanks
ReplyDelete
Replies

Add comment

About Me

Brad is a Service Engineer on Microsoft's FastTrack for Azure Team in Charlotte, NC. Brad helps individuals and organizations leverage Analytics and Azure to revolutionize themselves and their industries. He has an M.S. in Statistics from the University of South Carolina, MCSE Certification in Data Management and Analytics, MCSE Certification in Cloud Platform and Infrastructure and various MCSA Certifications in Business Intelligence and Advanced Analytics. Brad is an active blogger at breaking-bi.blogspot.com. He is also an organizer for the Charlotte BI Group, a local PASS chapter in Charlotte, NC. You can connect with him on LinkedIn at https://www.linkedin.com/in/bradllewellyn and on Twitter @BreakingBI.