Monday, September 17, 2018

Data Science in Power BI: Forecasting

Today, we're going to talk about Forecasting within Power BI.  If you haven't read the earlier posts in this series, Introduction, Getting Started with R Scripts, Clustering and Time Series Decomposition, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Forecasting in Power BI.

Today, we're going to change it up a little and compare two visuals from the Power BI Marketplace.  These visuals are "Forecasting with ARIMA" and "Forecasting TBATS".
Forecasting Visuals
One of the most fascinating aspects of Data Science is how endless the process can be.  There are always many different ways to approach the same problem.  Let's dig a little deeper into the two algorithms we'll be looking at today.

ARIMA stands for "AutoRegressive Integrated Moving Average".  Basically, it's a method for breaking a time series model down into three components.  It's important to note that this type of model is not capable of utilizing multiple variables.  It simply predicts future values of a variable based on previous values of that variable.  Technically, the "Forecasting with ARIMA" model also includes a seasonal component as well.  However, it can only include one continuous model for the trend, denoted by (p,d,q) and one continous model for the season, denoted by (P,D,Q,m).  The results from these models are added together to get the value for each point in time.  You can read more about it here and here.

TBATS stands for "Trigonometric, Box-Cox Transform, ARMA Errors, Trend, Seasonal".  Basically, it's a method for predicting a time series that exhibits more than one seasonal effect.  For instance, retail sales is affected independently by day of the week and month.  Sales may go up in December because of Christmas and may go up further on the weekend because most people are not working.  There's not a ton of information online about this technique, but you can read more it here and here.

Before we can utilize these visuals in Power BI, we may need to install some R packages.  If you've been following along in this series, you'll only need the forecast package.  An earlier post in this series, Clustering, walks through this process in R Studio.  As with the previous posts, we'll be using the Customer Profitability Sample PBIX.  You can download it here if you don't already have it.  Let's create forecasts for [Total COGS] by [Date] using both of these visuals.
ARIMA (Initial)
It looks like the ARIMA visual creates a decent-looking forecast out of the box.  We'll definitely try tweaking this, but let's see what TBATS can do.
TBATS (Initial)
Ouch.  It looks like the default number of predictions out of TBATS is too large.  It's predicting every month out until July 2056.  The ARIMA model only predicted ten months.  Let's match that using the available parameters.
TBATS Forecasting Settings (10 Months)
TBATS (10 Months)

The scale of the graph is much better now, but the forecast is still pretty worthless.  As a side note, we originally wanted to use "Total Revenue" for this analysis.  Alas, we were unable to get any type of useful forecasts using that variable.

As a differentiator from the ARIMA chart, this chart type allows us to explicitly define the seasonality of our data.
TBATS (6 Month Season)
Unfortunately, there was only one combination of seasonalities that worked for this data.  We had to set the seasonal period to 6 months.  This does not bode well for this forecasting algorithm as ease of use and flexibility are key features.  Regardless, this does give us some predictions to work with.  Let's see them together to compare.
Comparing these two, we see that the ARIMA forecast is more stable than the TBATS forecast.  However, it also has a much larger confidence interval.  With the forecasts we have here, it's impossible to determine how "accurate" the forecasts are because we don't know the future values.  In cases like this, it can be very helpful to use a holdout set.  This is when you build your forecasts leaving out the most recent data points so that you can compare the predicted values to the actual values.

This is where it gets tricky.  Our data starts in August 2013 and ends in November 2014.  That's 16 months of data.  This means that we would lose a huge chunk of information if we hold out too much information.  This is where a slightly more advanced technique comes into play, "one step ahead" forecasting.  Basically, we pick any historic point in the time series and remove all of the data points AFTER that point.  Effectively, we are building the time series model as we would have AT THAT POINT IN TIME.  Then, we use that model to predict the next point in time.  This mirrors how time series algorithms are utilized in practice.  To finish the approach, we replicate this technique for every point in time, effectively giving us predictions for every point in time.  Obviously, we can't go back too far, as time series trends change over large time periods and we don't have much data to begin with.  So, let's start by removing the three most recent months.
Three Month Holdout Filter
ARIMA (3M Holdout)
TBATS (3M Holdout)
We definitely have a problem.  The ARIMA model technically works with a three month holdout.  However, it gets reduced to an ARIMA(1,0,0) model which, as you can see, is just a straight line.  To make things worse, the TBATS model doesn't work at all.
ARIMA (1M Holdout)
TBATS (1M Holdout)
As it turns out, we are only capable of without a single month of data to perform our "one step ahead" analysis.  When we do this, our ARIMA model becomes the worthless ARIMA(0,0,0) that always predicts zero.  Our TBATS model is a little more useful.  We can hover over the lines to get the actual and predicted values for November 2014.
Predicted vs Actual
We see that the predicted value is 30% higher than the actual.  Obviously, this model is pretty terrible given everything we have tried so far.  So, what do we now?

We have a few options for how to proceed.  The best idea in this case would be to reduce the granularity of our dataset from months to weeks.  This would give us approximately 4.5x as many data points to work with.  In our case, that's not possible because the data is stored at the month level.  As a next best alternative, we could lean on the time series decomposition.  In our previous post, we explored this chart type.  Since we had to pivot to use [Total COGS] instead of [Total Revenue], here's what the decomposition for [Total COGS] looks like.
While this chart can't make accurate predictions for us, it can provide some basic insight.  For instance, we see that Trend and Seasonality both make up about 40% of the time series structure.  We also see that the Trend was increasing until the middle of the dataset (some time near the end of 2013 and the beginning of 2014) and began decreasing after that.  We can also see that there is a three month seasonal trend.  Since the last data point is in lowest point of the seasonal pattern and the trend seems to have flattened out, we can estimate that [Total COGS] will likely increase slightly for the next two months before falling in the third month.  While this isn't exactly what we were looking for, it is useful information that we may be able to leverage.

Hopefully, this post showcased some of the forecasting and time series analysis techniques available in Power BI.  These techniques require very little knowledge of statistical coding, but still allow us to get some valuable insights from our data.  Stay tuned for the next post where we will discuss Correlations.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions