Monday, January 6, 2014

Predictive Analytics in Tableau Part 2: Linear Regression with Multiple Regressors


Today, we will talk about using Tableau 8.1's R functionality to perform predictive analysis via Multiple Regression.  In our previous post in the series, Predictive Analytics in Tableau Part 1: Simple Linear Regression, we talked about Simple Linear Regression.  Now, we're moving a step up and adding multiple variables to the mix.  However, we're going to keep it simple for now and keep all of the regressors at degree 1.  For those of you without advanced mathematics, which is probably most of you, a variable of degree 1 is linear, i.e. a straight line.  When the degrees get higher, the function get more curves.  If you want to know more, you can check out this article.  For this analysis, we will use the same data set as in the previous post, which can be found here.

To begin, let's look back at our scatterplot matrix again to see what our data looks like.
Scatterplot Matrix
Initially, we can just throw everything against the wall and see what comes out.  We don't particularly care about knowing what the relationship is, we just want some predictions.  In these simple scenarios, it is safe to assume that more data is always better.  However, it's not always the case with real data.  But, we'll deal with that in a later post.  For now, let's just put everything in the model.  The code is too long to fit in one screenshot; so I have pasted it below:

SCRIPT_REAL( "

    ## Defining Variables

    cons <- .arg1
    crud <- .arg2
    djia <- .arg3
    fore <- .arg4
    gnp <- .arg5
    inte <- .arg6
    purc <- .arg7

    ## Fitting the Model

    fit <- lm( cons ~ crud + djia + fore + gnp + inte + purc )
    fit$fitted
"
, SUM( [CONSUMER] ), SUM( [CRUDE] ), SUM( [DJIA] ), SUM( [FOREIGN] ),
SUM( [GNP] ), SUM( [INTEREST] ), SUM( [PURCHASE] ) )

Now, let's see the results.
Predicted Consumer Debt by Year (Text Table)
Just like in our previous post, this text table isn't very easy to read.  However, if you look closely, you will see that the predictions are significantly closer since we included all of the variables.  Now, we're stuck with another dilemma.  How do we see the results visually?  We can't have a seven-dimensional scatterplot.  Well, there are a couple of different ways.  First, let's look at what's called a Residual vs. Predicted plot.

Mathematically, the residuals are the differences between the actual value and the predicted value.  In the business world, you'll often hear this called a "Delta."
Consumer (Residual)
Now, imagine that our model fit our data extremely well.  We wouldn't expect that the model fit perfectly.  However, we would expect the model to remove most of the "systematic" variation within the data, leaving only random noise.  This is precisely what this plot is designed to see.
Residual vs. Predicted

Reading these types of charts is more art than science.  But, you can ask yourself one question, "Does the data form any significant pattern?"  We would say no.  Therefore, we believe that our model is a good fit for this data.  Now, let's visualize the data in its original context, by year.  Remember in the last post where we talked about "Prediction Intervals"?  We can make those here as well, using almost identical code.  The code for Consumer (Predicted Lower) is pasted below:

SCRIPT_REAL( "

    ## Defining Variables

    cons <- .arg1
    crud <- .arg2
    djia <- .arg3
    fore <- .arg4
    gnp <- .arg5
    inte <- .arg6
    purc <- .arg7

    ## Fitting the Model

    fit <- lm( cons ~ crud + djia + fore + gnp + inte + purc )

    ## Creating the Prediction Interval

    dat <- data.frame(cbind(cons,crud,djia,fore,gnp,inte,purc))
    predict(fit, dat, interval = 'prediction')[,2]
"
, SUM( [CONSUMER] ), SUM( [CRUDE] ), SUM( [DJIA] ), SUM( [FOREIGN] ),
SUM( [GNP] ), SUM( [INTEREST] ), SUM( [PURCHASE] ) )

To find the upper bound, you simply need to change the ,2 to a ,3.  Finally, let's plot our data.
Predicted Consumer Debt by Year (Line Chart)
As you can see, the bounds fit the actual data (blue line) very tightly and follow it as it increases over time.  Now, some of the more knowledgeable readers might say, "Multiple Regression requires uncorrelated observations, not time series data!"  You would be correct.  However, our goal here was simply to predict Consumer Debt.  We don't care about using the model to show correlation/causation between these variables.  We'll leave that to the econometricians.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

18 comments:

  1. Thanks for this series. I do have a question however. When writing the calc for the prediction intervals, what do the[,2] and [,3] represent? It obviously gives the lower and upper interval, but what specifically does it mean?

    Thanks,

    ReplyDelete
    Replies
    1. That's a really good question. The predict function returns a table with the following columns: Prediction, Lower Bound, and Upper Bound. Therefore, since we need to pull out one column at a time, we give it a column number to pull out. We could have also used [,1], but that would give us that same values we get when we use fit$fitted.

      Delete
  2. Good day Sir,

    I am new to the R language, I just started to educate myself with it just this year. I was trying to perform a linear model with multiple variables with my Tableau. My syntax is:

    SCRIPT_REAL( ”

    ## Defining Variables

    [GVA]<- .arg1,
    [Emp]<- .arg2,
    [Surv]<- .arg3,
    [CP]<- .arg4,

    ## Fitting the Model

    fit <- lm( GVA ~ Emp + CP + Surv)
    fit$fitted
    "
    ,SUM( [GVA]), SUM([CP]), SUM([Emp]), SUM([Surv]))

    but I received an error saying:

    Error in base::parse(text = .cmd) : :5:5: unexpected ‘[‘
    4:
    5: [
    ^
    What seems to be the problem? Thank you so much.

    ReplyDelete
    Replies
    1. Edward,

      Thanks for commenting! As far as I can see, there are two issue with your code. First, in the ## Defining Variables section, you shouldn't use commas to end the lines. R automatically executes a line when you "Carriage Return". Second, the order of the variables in the Define Variables section need to be the same as the order in the last line of the code where you have all of the SUM(). Otherwise, you will get false results.

      Thanks!

      Delete
  3. Thanks very much for the informative session. I'm having a bit of a problem. If I would like to see each of the predicted y value, how would I be able to do that? It seems like Tableau is putting everything into the SUM, and if I don't want to see it by year, rather, by company or by field, it stops working?

    ReplyDelete
    Replies
    1. Thanks for commenting! Your problem is likely due to the "Compute Using". The R interface is actually just a table calculation. So, when you change your dimension from Year to Company, you will probably need to change your Compute Using as well. Hopefully this post will help you.

      http://breaking-bi.blogspot.com/2013/07/introduction-to-table-calculations.html

      Delete
  4. Thank you Brad. Your blog was helpful. is it possible to view a summary of the linear regression? I am thinking the R function "summary(lm)" in tableau.

    ReplyDelete
    Replies
    1. In a simple way, no. The data you get back from R has to be 1-1 with the data you send in. So, if you send in 20 rows of data (considering a tabular), you will get 20 values back. You can get each value back individually if you'd like....and display them on a chart. But you can't see the R output within R.

      Delete
  5. Hi Brad,

    Your blog is awesome, it's helping me a lot.

    I'm having issues with the multiple regression procedure in Tableau.

    I am using the following code:

    SCRIPT_REAL("
    Score <- .arg1
    A <- .arg2
    B <- .arg3
    C <- .arg4
    D <- .arg5
    E <- .arg6
    F <- .arg7
    G <- .arg8

    fit <- lm( OSAT ~ A + B + C + D + E + F + G )
    fit$fitted
    "
    , avg([Score]), avg([A]), avg([B]), avg([C]), avg([D]), avg([E]), avg([F]), avg([G]))

    What I actually need is the coefficient score and the P-value for each of those arguments (A to G)

    So I tried changing the fit$fitted for something like this:

    fit$coefficients

    Also, the data frame I am using come from a survey database. It looks like this:

    Respondent # | Score | A | B | C | D | E | F | G|
    #1 | 10 | 9 | 9 | 8 | 7 | 6 | 7 | 5
    #2 | 5 | 4 | 2 | 3 | 5 | 7 | 8 | 6
    ...
    i

    So what I am trying to achieve is to get the coefficients and p-value on a single sheet for all those arguments (A to G) in order to make a scatter plot.

    Thanks a lot


    ReplyDelete
    Replies
    1. Thanks for commenting! You've run into the most frustrating part of the Tabeau/R integration. The data that R returns has have the same length as the data you send to it. For instance, if you send in vector #1, which has 8 values, you must return 8 values. So, if you wanted to return the p-value, you have to do rep( P, 8 ). Does this make sense?

      Delete
  6. Good to know it's the most frustrating part, I thought maybe I was the only one facing this issue!

    I am not sure what you mean by doing rep( P, 8 ). What I am trying to achieve is to get those values (the 8 coefficients and 8 p-values) in a single table in Tableau. The best way would be to get a matrix from R but since we get only a single value I am wondering how I can modify my code in order to get those values into Tableau.

    The visulization I'm looking for is a bubble chart:
    X axis = average Score
    Y axis = Coefficient
    The bubble would be A-B-C-D-E-F-G according to their average score and coefficient.
    And I would filter by p-value > 0.05 to show only the attributes that are statistically significant.

    Tell me if I’m not being clear enough, I can send you an example of what I’m trying to achieve.

    Thanks,

    Gabriel

    ReplyDelete
    Replies
    1. My mistake. I misunderstood your data structure. So, you have a data set with X rows and columns Score, A, B, C, D, E, F, G + some more dimensions. The issue here is that you would traditionally place your ID field (unique ID for each row/observation) onto the Detail Shelf along with AVG( [Score] ), AVG( [A] ), etc. This would mean that your Scatterplot has X points on the chart, not the 7 that you would want in order to plot a point for each Regressor. To simplify the story substantially, what you're asking for is possible, yet extremely complex. It would require an expert knowledge of data structure and Table Calculations as well as a decent amount of time to work through it. After that, it would likely not perform very well given any reasonable amount of data.

      If you still wish to attempt this, I'll give you one piece of advice "The R Script is nothing more than a table calculation, and you can treat it as such."

      Sorry I didn't have better news. If you do figure it out, please post it up here. It would be a great experience for myself and the other readers.

      Cheers,

      Brad Llewellyn

      Delete
    2. All right, thanks a lot.

      If I make it work I will definitely post it here!

      Gabriel

      Delete
  7. How can I show the regression model's coefficients in Tableau? eg. We can use something like summary(fit)$coefficients[,1] in R to show one of the coefficients, but how to show it in Tableau? Thanks!

    ReplyDelete
  8. I keep getting a perfect model fit & my residuals are zero. I have a large data set & purposefully duplicated all variables in one row with two different y values..& it STILL predicts perfectly. I'm obviously doing something wrong. Can you help?

    ReplyDelete
  9. I keep getting a perfect model fit & my residuals are zero. I have a large data set & purposefully duplicated all variables in one row with two different y values..& it STILL predicts perfectly. I'm obviously doing something wrong. Can you help?

    ReplyDelete
  10. I keep getting a perfect model fit & my residuals are zero. what is wrong witn my calculating? thanks

    ReplyDelete
    Replies
    1. When potting residuals, check that your table calculation has the correct "compute using" setting. If may default to Table(Across) when you need it to be the set to the key that identifies each row, which in the example above, is YEAR.

      Delete