Monday, October 8, 2018

Data Science in Power BI: Correlations

Today, we're going to talk about Correlations within Power BI.  If you haven't read the earlier posts in this series, Introduction, Getting Started with R Scripts, Clustering, Time Series Decomposition and Forecasting, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Correlations in Power BI.

Correlation is a measure of how different values tend to relate to each other.  When we talk about correlation in a statistical context, we are typically referring to Pearson Correlation.  Pearson Correlation is the measure of how linear the relationship between two sets of values is.  For instance, values that fall perfectly in a "up and to the right" line would have a correlation of 1, while values that fall roughly on that line may have a correlation closer to .5.  These values can even be negative if the line travels "down and to the right".
Pearson Correlation
In many industries, the ability to determine which values tend to correlate with each other can have tremendous value.  In fact, one of the first steps commonly performed in the data science process is to identify variables that highly correlate to the variable of interest.  For instance, if we find that Sales is highly correlated to Customer Age, we could utilize Customer Age in a model to predict Sales.

So, how can we use Power BI to visualize correlations between variables?  Let's see some different ways.  We'll start by making a basic table with one slicer and two measures.
Revenue Current Month and Revenue Previous Month by Customer Name (Table)
This table lets us see the current and previous month's revenue by customer.  While this is good for finding individual customers, it doesn't give us a good idea of how closely related these two measures are.  Scatterplots are usually much better at visualizing this type of information.  Let's switch over to one of those.
Revenue Current Month and Revenue Previous Month by Customer Name (Scatterplot)
We can add a trend line to this graph by using the "Analytics" pane.
Add Trend Line
Revenue Current Month and Revenue Previous Month by Customer Name (Trend)
There's something missing with the way this data is displayed.  It's very difficult to understand our data when it looks like this.  Given the nature of money, it's common for a few large customers to have very large values.  One way to combat this to change the scale of our data by using the logarithm transformation.  It's important to note that the LN() function in DAX returns an error if it receives a negative or zero value.  This can be remedied using the IFERROR() function.
Revenue Current Month and Revenue Previous Month by Customer Name (Log)
We can see now that our relationship is much more linear.  It's important to note that Pearson correlation is only applicable to linear relationships.  By looking at this scatterplot, we can guess that our correlation is somewhere between .6 (60%) and .8 (80%).

Now, how would we add another variable to the mix?  Let's try with COGS.
Revenue Current Month and Revenue Previous Month by Customer Name (COGS)
It's not easy to see which scatterplot has the higher correlation.  In addition, this solution required us to create another chart.  While this is very useful for determining if any transformations are necessary (which they were), it isn't very scalable to being able to visualize a large number of variables at once.  Fortunately, the Power BI Marketplace has a solution for this.
Correlation Plot
If you haven't read the previous entries in this series, you can find information on loading Custom R Visuals in this post.  Once we load the Correlation Plot custom visual, we can utilize it pretty simply.
Correlations
We made one tweak to the options to get the coefficients to display, but that's it.  This chart can very easily allow us to look across a number of variables at once to determine which ones are correlated heavily.  This, combined with the scatterplots we saw earlier, gives us quite a bit of information about our data that could be used to create a great predictive or clustering model.

Hopefully, this post piqued your interest to investigate the options for visualizing correlations within Power BI.  This particular custom visual has a number of different options for changing the visual, as well as grouping variable together based on clusters of correlations.  Very cool!  Stay tuned for the next post where we'll dig into the R integration to create our own custom visuals.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, September 17, 2018

Data Science in Power BI: Forecasting

Today, we're going to talk about Forecasting within Power BI.  If you haven't read the earlier posts in this series, Introduction, Getting Started with R Scripts, Clustering and Time Series Decomposition, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Forecasting in Power BI.

Today, we're going to change it up a little and compare two visuals from the Power BI Marketplace.  These visuals are "Forecasting with ARIMA" and "Forecasting TBATS".
Forecasting Visuals
One of the most fascinating aspects of Data Science is how endless the process can be.  There are always many different ways to approach the same problem.  Let's dig a little deeper into the two algorithms we'll be looking at today.

ARIMA stands for "AutoRegressive Integrated Moving Average".  Basically, it's a method for breaking a time series model down into three components.  It's important to note that this type of model is not capable of utilizing multiple variables.  It simply predicts future values of a variable based on previous values of that variable.  Technically, the "Forecasting with ARIMA" model also includes a seasonal component as well.  However, it can only include one continuous model for the trend, denoted by (p,d,q) and one continous model for the season, denoted by (P,D,Q,m).  The results from these models are added together to get the value for each point in time.  You can read more about it here and here.

TBATS stands for "Trigonometric, Box-Cox Transform, ARMA Errors, Trend, Seasonal".  Basically, it's a method for predicting a time series that exhibits more than one seasonal effect.  For instance, retail sales is affected independently by day of the week and month.  Sales may go up in December because of Christmas and may go up further on the weekend because most people are not working.  There's not a ton of information online about this technique, but you can read more it here and here.

Before we can utilize these visuals in Power BI, we may need to install some R packages.  If you've been following along in this series, you'll only need the forecast package.  An earlier post in this series, Clustering, walks through this process in R Studio.  As with the previous posts, we'll be using the Customer Profitability Sample PBIX.  You can download it here if you don't already have it.  Let's create forecasts for [Total COGS] by [Date] using both of these visuals.
ARIMA (Initial)
It looks like the ARIMA visual creates a decent-looking forecast out of the box.  We'll definitely try tweaking this, but let's see what TBATS can do.
TBATS (Initial)
Ouch.  It looks like the default number of predictions out of TBATS is too large.  It's predicting every month out until July 2056.  The ARIMA model only predicted ten months.  Let's match that using the available parameters.
TBATS Forecasting Settings (10 Months)
TBATS (10 Months)


The scale of the graph is much better now, but the forecast is still pretty worthless.  As a side note, we originally wanted to use "Total Revenue" for this analysis.  Alas, we were unable to get any type of useful forecasts using that variable.

As a differentiator from the ARIMA chart, this chart type allows us to explicitly define the seasonality of our data.
TBATS (6 Month Season)
Unfortunately, there was only one combination of seasonalities that worked for this data.  We had to set the seasonal period to 6 months.  This does not bode well for this forecasting algorithm as ease of use and flexibility are key features.  Regardless, this does give us some predictions to work with.  Let's see them together to compare.
Together
Comparing these two, we see that the ARIMA forecast is more stable than the TBATS forecast.  However, it also has a much larger confidence interval.  With the forecasts we have here, it's impossible to determine how "accurate" the forecasts are because we don't know the future values.  In cases like this, it can be very helpful to use a holdout set.  This is when you build your forecasts leaving out the most recent data points so that you can compare the predicted values to the actual values.

This is where it gets tricky.  Our data starts in August 2013 and ends in November 2014.  That's 16 months of data.  This means that we would lose a huge chunk of information if we hold out too much information.  This is where a slightly more advanced technique comes into play, "one step ahead" forecasting.  Basically, we pick any historic point in the time series and remove all of the data points AFTER that point.  Effectively, we are building the time series model as we would have AT THAT POINT IN TIME.  Then, we use that model to predict the next point in time.  This mirrors how time series algorithms are utilized in practice.  To finish the approach, we replicate this technique for every point in time, effectively giving us predictions for every point in time.  Obviously, we can't go back too far, as time series trends change over large time periods and we don't have much data to begin with.  So, let's start by removing the three most recent months.
Three Month Holdout Filter
ARIMA (3M Holdout)
TBATS (3M Holdout)
We definitely have a problem.  The ARIMA model technically works with a three month holdout.  However, it gets reduced to an ARIMA(1,0,0) model which, as you can see, is just a straight line.  To make things worse, the TBATS model doesn't work at all.
ARIMA (1M Holdout)
TBATS (1M Holdout)
As it turns out, we are only capable of without a single month of data to perform our "one step ahead" analysis.  When we do this, our ARIMA model becomes the worthless ARIMA(0,0,0) that always predicts zero.  Our TBATS model is a little more useful.  We can hover over the lines to get the actual and predicted values for November 2014.
Predicted vs Actual
We see that the predicted value is 30% higher than the actual.  Obviously, this model is pretty terrible given everything we have tried so far.  So, what do we now?

We have a few options for how to proceed.  The best idea in this case would be to reduce the granularity of our dataset from months to weeks.  This would give us approximately 4.5x as many data points to work with.  In our case, that's not possible because the data is stored at the month level.  As a next best alternative, we could lean on the time series decomposition.  In our previous post, we explored this chart type.  Since we had to pivot to use [Total COGS] instead of [Total Revenue], here's what the decomposition for [Total COGS] looks like.
Decomposition
While this chart can't make accurate predictions for us, it can provide some basic insight.  For instance, we see that Trend and Seasonality both make up about 40% of the time series structure.  We also see that the Trend was increasing until the middle of the dataset (some time near the end of 2013 and the beginning of 2014) and began decreasing after that.  We can also see that there is a three month seasonal trend.  Since the last data point is in lowest point of the seasonal pattern and the trend seems to have flattened out, we can estimate that [Total COGS] will likely increase slightly for the next two months before falling in the third month.  While this isn't exactly what we were looking for, it is useful information that we may be able to leverage.

Hopefully, this post showcased some of the forecasting and time series analysis techniques available in Power BI.  These techniques require very little knowledge of statistical coding, but still allow us to get some valuable insights from our data.  Stay tuned for the next post where we will discuss Correlations.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, August 27, 2018

Data Science in Power BI: Time Series Decomposition

Today, we're going to talk about Time Series Decomposition within Power BI.  If you haven't read the earlier posts in this series, Introduction, Getting Started with R Scripts and Clustering, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Time Series Decomposition in Power BI.

FYI: The Power BI August Newsletter just announced Python compatibility.  We're looking forward to digging into that in future posts.  You can find the newsletter here.

First, let's talk about what time series data is.  Simply put, anything that can be measured at individual points in time can be a time series.  For instance, many organizations record their revenue on a daily basis.  If we plot this revenue as a line across time, we have time series data.  Often, time series data is measured at regular intervals.  Weekly measurements are one example of this.  However, there are many cases where irregular time intervals are used.  For instance, calendar months are not all equal in size.  Therefore, a time series of this data would have irregular time intervals.  This isn't necessarily a bad thing, but it should be considered when doing important analyses.  You can read more about time series data here.

Now, let's talk about Time Series Decomposition.  Time Series Decomposition is the process of taking time series data and separating it into multiple underlying components.  In our case, we'll be breaking our time series into Trend, Seasonal and Random components.  The Trend component is useful for telling us whether our measurements are going up or down over time.  The Seasonal component is useful for telling us how heavily our measurements are affected by regular intervals of time.  For instance, retail data often has heavy yearly seasonality because people buy particular items at particular times of year, especially during the holidays.  Finally, the Random component is what's left over when we remove the Trend and Seasonal components.  You can read more about this technique here and here.

Let's hop into Power BI and make a quick time series chart.  We'll be using the same Customer Profitability Sample PBIX from the previous posts.  You can download it here.  If you haven't read Getting Started with R Visuals, it's recommended that you do so now.  Let's start by making a simple line chart of Total Revenue by Month.
Total Revenue by Year
Looking at this data, it seems that our Total Revenue is increasing over time.  However, it's difficult to know how strongly or if there is any seasonality.  This is where the Time Series Decomposition chart can help us.  Just as we did in the previous post, Clustering, we'll download this from the Power BI Marketplace.
Import From Marketplace
Power BI Visuals
Time Series Decomposition Chart
Time Series Decomposition Chart Description
Looking at the description of the Time Series Decomposition chart, we see that it requires the proto and zoo packages in R.  This will be important later in this post.  Let's scroll up and add this chart to our PBIX.
Add Time Series Decomposition Chart
Now, let's change our line chart to a time series decomposition chart by selecting the icon in the Visualizations pane.
Change to Time Series Decomposition Chart
Enable Script Visuals
Since we just opened this report, we need to enable R Script Visuals.

If you get this error, you need to install the zoo and proto R packages.  The previous post walks through this process.  You may need to save and reopen the PBIX after installing the packages to see the chart.
Time Series Decomposition of Total Revenue by Month
This chart is extremely interesting.  In order, these lines represent the actual data, seasonal, trend and remainder, also known as random, components.  However, this is too small to easily read.  Fortunately, the makers of this chart give us the option to look at one piece at a time.  Let's take a look at the trend first.
View Trend
Trend
This shows us the trend of our time series is red, compared to the actual data in light grey.  Now we have much more evidence to say that our trend was increasing over time.  However, we also see that the algorithm recorded a decreasing trend recently.  It also didn't seem to pick up on the recent spike at all.  We'll look into this later in this post.  For now, let's take a look at the seasonality.
Seasonality
Looking at this, it appears our revenue spikes every third month (March, June, September, December).  Unfortunately, the lack of detail on the horizontal axis makes this slightly frustrating to investigate.  Perhaps in a later post we'll crack open this R code and make some adjustments.  Let's move on to the remainder.
Remainder
According to this algorithm, it looks like the algorithm was able to accurately predict earlier values in the series.  However, the more recent values in the series are showing some troubling variation.  This is apparent by looking at the actual data as well.  It's possible that our recent revenue values are not following the same pattern as last year.  This may be an opportunity to utilize an algorithm that doesn't consider time as a factor, such as Regression.
Clean
Interestingly, there's an option to display the "Clean" data, i.e. Trend and Seasonality without the Remainder.  This could be an interesting way to see how well the decomposition fits the data.  As we suspected, the recent months are causing an issue with the algorithm.

We could spend all day looking at all the data available here.  Instead, let's end by looking at one final aspect of this chart, Algorithm Parameters.
Algorithm Parameters
We have the option of tinkering with the parameters a little to make our algorithm better.
Clean (Degree)
By setting the Degree parameter to "On", we can sacrifice accuracy elsewhere in the series to be able to account for the recent spike.  Is this the "right" answer?  That's a much more complicated question.  Let us know your thoughts in the comments.

Hopefully, this post opened your eyes just a little to the possible of performing time series analysis within Power BI.  The custom visuals in the marketplace provide a strong "middle ground" offering that makes advanced analyses possible outside of hardcore coding tools like R and Python.  Stay tuned for the next post where we'll be talking about Forecasting.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, August 6, 2018

Data Science in Power BI: Clustering

Today, we're going to talk about Clustering within Power BI.  If you haven't read the previous posts in this series, Introduction and Getting Started with R Scripts, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Clustering in Power BI.

Clustering is a statistical process by which we group records or observations into categories based on how similar they are to one another.
Clustering
This is very easy to visualize in two dimensions.  We can see that clustering can be used to determine that there are three groups of observations, known as clusters, in our data.  As we add more dimensions, it becomes difficult, or even impossible, to visualize.  This is where clustering algorithms can help.  You can read more about Clustering here.

Fortunately, the Power BI Marketplace has a pre-built "Clustering" visual that uses R scripts on the back-end.  Let's check it out!

We start by opening the "Customer Profitability Sample PBIX" we downloaded in the previous post.  In case you haven't read that post, you can find the file here.
Import From Marketplace
Then, we navigate to the "Visualizations" pane and select "... -> Import From Marketplace".
Power BI Marketplace
This takes us to the "Power BI Marketplace".  Here, we can look at all of the custom visuals created by Microsoft and members of the community.
Power BI Marketplace (Cluster)
If we search for "cluster", we can add the "Clustering" custom visual by clicking the "Add" button.  This will add this custom visual as an option in our Power BI Report.
Visualizations
We can see that the "Clustering" custom visual has been added to the list.  Let's make a new page in the report and add an empty "Clustering" chart.
Enable Script Visuals
We were prompted to enable script visuals.  This is necessary when we utilize R scripts or custom visuals that utilize them.  If you get some type of error here, you may need to read the previous post to ensure your R environment in Power BI is correctly configured.
Can't Display This Visual
During this process, we also stumbled across this error.  The "Clustering" chart tries to load a number of R packages.  However, these R packages need to be installed on the machine in order for this to work.  Fortunately, it's pretty simple to install packages using RStudio.
Install Packages
We kept installing packages until we completed the list.  There is one package, Redmonder, that we could not install through RStudio.  Instead, we had to download the package from here and manually copy it to the R package directory.  In our case, this is
C:\Users\<Username>\Documents\R\win-library\3.3\
 Once we completed that, we faced our next challenge.  This custom visual does not allow us to use measures.  This means that basic questions like "How many clusters of customers exist based on Total Revenue and Total Labor Costs?" become more complex to solve.  Fortunately, it's not too difficult to turn the measures into calculated columns.  We do this by creating a new table in Power BI.
New Table
We start by selecting the "New Table" button in the "Modeling" tab at the top of the screen.
Customer Summary
Then, we use the following code to create a table "CustomerSummary".  This table contains one record for each Customer Name, along with the Total Revenue and Total Labor Costs associated to that Customer.

<CODE START>

CustomerSummary = 
SUMMARIZE(
'Fact'
,'Customer'[Name]
,"Total Revenue", SUM( 'Fact'[Revenue] )
,"Total Labor Costs", SUM( 'Fact'[Labor Costs Variable] )
)

<CODE END>

You can read more about the SUMMARIZE() function here if you are interested.
Total Revenue and Total Labor Costs by Customer Name
Finally, we can create our Clustering chart by adding [Total Revenue] and [Total Labor Costs] to the "Values" shelf and [Name] to the "Data Point Labels" shelf.  The resulting chart is less than inspiring.  "Cluster 2" at the top-right of the chart contains two outlying points, leaving "Cluster 1" to contain the rest of the dataset.  The issue here is that K-Means Clustering is not robust against outliers.  You can more about K-Means Clustering here.

In order to make a more interesting set of clusters, we need a way to reduce the influence of these outliers.  One approach is to rescale the data so that extremely large observations aren't quite so large.  Two common functions to handle this are logarithm and square root.  Let's alter our code to take the square root of [Total Revenue] and [Total Labor Costs].

<CODE START>

CustomerSummary = 
SUMMARIZE(
'Fact'
,'Customer'[Name]
,"Total Revenue (Square Root)", SQRT( SUM( 'Fact'[Revenue] ) )
,"Total Labor Costs (Square Root)", SQRT( SUM( 'Fact'[Labor Costs Variable] ) )
)

<CODE END>

Since we altered the field names, we also need to recreate our chart to contain the [Total Labor Costs (Square Root)] and [Total Revenue (Square Root] fields in the "Values" shelf.
Total Revenue (Square Root) and Total Labor Costs (Square Root) by Customer Name
We can see that we now have a more interesting set of clusters.  The outliers are still present, but not nearly as dominant.  This custom visual has quite a few more interesting options.  We encourage you to play around with it to create some visuals that would not be possible in standard Power BI.

Hopefully, this post opened your eyes to the Data Science possibilities in Power BI.  The power of custom R scripts has created a new world of potential, just waiting to be unleashed.  Stay tuned for the next post where we'll be talking about Time Series Decomposition.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com