## Monday, February 17, 2014

### Predictive Analytics in Tableau Part 8: ARIMA Time Series

Today, we will talk about creating ARIMA time series models using Tableau 8.1's new R functionality.  In layman's terms, an ARIMA model uses three different numeric parameters to make varying types of time series models.  The ARIMA family is one of the most researched and respected families in the field of Time Series Analysis.  It is also used by a number of automated prediction tools like SQL Server Analysis Services.  As usual, we will use the Superstore Sales sample data set from Tableau.

The most important part of creating an ARIMA model is choosing the parameters.  The parameters are as follows:

AR  - Autoregressive
I      - Integrated
MA - Moving Average

We won't delve heavily into what each of these parameters does.  However, we will talk about methods for selecting them.  First, let's take a look at our data.

 Sales by Month
Now, let's look at choosing the AR parameter.  In order to do this, we need to look at the Partial Autocorrelation Function (PACF).  Just like with the previous posts, almost all of the code is used to create the time series.  Only the last line is used to give us the PACF values.  The full code can be found in the appendix at the end of the post.

rep(pacf(timeser, plot=FALSE)\$acf,5)[1:len.orig]

Now, let's look at these values.
 Sales by Month (PACF)
The question we want to ask is "Starting at the beginning, how many consecutive values are above .25 or below -.25?"  We can easily accomplish this with colors.
 AR Colors
 Sales by Month (PACF with Color)
We can see that there are no values above .25 or below -.25.  So, there doesn't seem to be an AR component to this model.  Next, let's move on to the MA component using a very similar technique.  Moving Average components can be analyzed using the Autocorrelation Function (ACF).  Here's the last line of code.

rep(acf(timeser, plot=FALSE)\$acf,5)[1:len.orig]

Now, let's look at the values.
 Sales by Month (ACF with Color)
The first value is always guaranteed to be 1 and should be ignored.  We see that the values after one aren't really worth noting.  Therefore, the MA component in the model is 0 and the AR component is 0.  Now, let's look at the Integrated portion of the model.  R has a nice built-in function called ndiffs() that tells us what the I component should be.  Here's the last line of code.

ndiffs(timeser)

Now, let's see the value.
 Sales by Month (I Component)
Here, we see that our I component should be zero.  This leaves us with a bit of an interesting conundrum.  An ARIMA(0,0,0) model is basically useless.  It is guaranteed always return the mean (average) value.  So, this means that the ARIMA family is not a good fit for this data.  Just for kicks, let's swap this data out for another data set that does fall in the ARIMA family.  So, let's look at a data set with monthly Unemployment values for the US.
 Unemployment by Month
Now, let's look at the ARIMA components.
 Unemployment by Month (ARIMA Components)
We can see that there are 3 consecutive AR values above .25 or below -.25, and I of 0, and 2 MA values (the first of which we ignore).  So, our model should be ARIMA(3,0,1).  Now, let's see what our forecasts look like.
 Unemployment by Month (Forecast)
We can see that the ARIMA model seems to be pretty good at predicting the up-and-down motion of the data.  However, the recent rise in unemployment has caused some variability that the model doesn't handle very well.  Fortunately, there are plenty of other families of models out there.  Thanks for reading.  We hope you found this informative.

Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com

EDIT: An anonymous user noted an error in the original appendix.  The R variables were defined as

year.orig <- .arg1
month.orig <- .arg2

while our assignment at the end of the segment was defining

ATTR( MONTH( [Order Date] ) ) = .arg1
ATTR( YEAR( [Order Date] ) ) = .arg2

This has now been fixed an all of the code should be valid.  Many thanks to our readers for pointing that out.

Appendix

AR Component

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg2
month.orig <- .arg1
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)\$ix,]
dat.new <- dat.orig[1:len.new,]

## Fitting the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
rep(pacf(timeser, plot=FALSE)\$acf,5)[1:len.orig]
",

ATTR( MONTH( [Order Date] ) ), ATTR( YEAR( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

I Component

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg2
month.orig <- .arg1
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)\$ix,]
dat.new <- dat.orig[1:len.new,]

## Fitting the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
ndiffs(timeser)
",

ATTR( MONTH( [Order Date] ) ), ATTR( YEAR( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

MA Component

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg2
month.orig <- .arg1
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)\$ix,]
dat.new <- dat.orig[1:len.new,]

## Fitting the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
rep(acf(timeser, plot=FALSE)\$acf,5)[1:len.orig]
",

ATTR( MONTH( [Order Date] ) ), ATTR( YEAR( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

Forecast (ARIMA(0,0,1))

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg2
month.orig <- .arg1
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)\$ix,]
dat.new <- dat.orig[1:len.new,]

## Fitting the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
fit <- arima(timeser, order=c(0, 0, 0))
c(rep(NA, len.new), forecast(fit)[[4]][1:hold.orig])
",

ATTR( MONTH( [Order Date] ) ), ATTR( YEAR( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

Forecast (ARIMA(3,0,1))

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg2
month.orig <- .arg1
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)\$ix,]
dat.new <- dat.orig[1:len.new,]

## Fitting the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
fit <- arima(timeser, order=c(3, 0, 1))
c(rep(NA, len.new), forecast(fit)[[4]][1:hold.orig])
",

ATTR( MONTH( [Order Date] ) ), ATTR( YEAR( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

1. This comment has been removed by the author.

1. You're Welcome!

2. This comment has been removed by the author.

1. Amelia,

Thanks for commenting! You can do just about anything you want using the R integration in Tableau. The only caveat is that you need someway to visualize it. There's nothing stopping you from using something like adf.test() inside the script and returning the result as a boolean (True/False). Does this help?

3. This comment has been removed by the author.

I am using the same data source (Superstore Sales), Tableau 8.1 but am getting the following errors when I create each of the calculated fields (AR, I, MA)

I copied and pasted in the code directly from your webpage but Im getting an error saying my calculation contains arrors "Cannot mix aggregate and non-aggregate arguments to function"

I am probably doing something very obvious but have been trying to figure this out all morning so am hoping you can help.
P.S - your K-means clustering post was brilliant - I used it with RFM based data and it worked very well so thank you very much for that :)

1. Thanks for the support! The reason you're getting that error is because one of the values you are passing is not an aggregate. At the very end of each section of code, notice how every value is either wrapped in ATTR() or SUM()? Make sure that you are doing the same. If you can't find it, post it here and i'll see what I can do.

Thanks so much for your reply - . Im still stuck on this - I have pasted my code below which I was trying to create the first calculated field named "AR Component". I copied and pasted your code. I initially had 2 errors when I first tried earlier today - [Months to Forecast] was not recognised, but this error disappeared when I added an extra column to the Superstore excel file named Months to Forecast.

Code as follows (thank you very much in advance)

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg1
month.orig <- .arg2
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)\$ix,]
dat.new <- dat.orig[1:len.new,]

## Fitting the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
rep(pacf(timeser, plot=FALSE)\$acf,5)[1:len.orig]
",

ATTR( MONTH( [Order Date] ) ), ATTR( YEAR( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

1. Thanks for posting the code. You missed an important part from a couple of posts before this. Months to Forecast is a integer parameter.

6. Hi,
Thank you for extremely insightful posts on Tableau-R integration.
I have one comment/question related to code in this post.

Since we have:
year.orig <- .arg1
month.orig <- .arg2

shouldn't line with arguments be:

ATTR( YEAR( [Date] ) ), ATTR( MONTH( [Date] ) ), SUM( [Value]), [Months to Forecast] )
rather than month/year in all definitions above.
Thanks,
Jacek.

1. You are absolutely right, I can't believe I didn't catch that. I'm not sure whether that piece of code was actually used for the post. Thanks!

7. Yeah, the data in the screenshots looks OK :-)
Great blog!
Jacek.

8. getting error in r and tableau connectivity

1. Praba,

I'm not quite sure what you're trying to say. Have you set up the R connection inside Tableau? If you haven't, you should read this post.

2. Praba, try this:

1-Open the R Console
2-type library("Rserve") plus Enter
3-type Rserve() plus Enter
4-Open Tableau and go to the Help Menu > Settings and Performance > Manage R Connection > click on test connection (here, make sure that you have server = localhost and a port assigned).
5- After clicking on test connection a message similar to: the connection is successful, should appear on the screen. After you´ve seen this, you are pretty much ready to rock!

Best regards,

Paul