Monday, August 27, 2018

Data Science in Power BI: Time Series Decomposition

Today, we're going to talk about Time Series Decomposition within Power BI.  If you haven't read the earlier posts in this series, Introduction, Getting Started with R Scripts and Clustering, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Time Series Decomposition in Power BI.

FYI: The Power BI August Newsletter just announced Python compatibility.  We're looking forward to digging into that in future posts.  You can find the newsletter here.

First, let's talk about what time series data is.  Simply put, anything that can be measured at individual points in time can be a time series.  For instance, many organizations record their revenue on a daily basis.  If we plot this revenue as a line across time, we have time series data.  Often, time series data is measured at regular intervals.  Weekly measurements are one example of this.  However, there are many cases where irregular time intervals are used.  For instance, calendar months are not all equal in size.  Therefore, a time series of this data would have irregular time intervals.  This isn't necessarily a bad thing, but it should be considered when doing important analyses.  You can read more about time series data here.

Now, let's talk about Time Series Decomposition.  Time Series Decomposition is the process of taking time series data and separating it into multiple underlying components.  In our case, we'll be breaking our time series into Trend, Seasonal and Random components.  The Trend component is useful for telling us whether our measurements are going up or down over time.  The Seasonal component is useful for telling us how heavily our measurements are affected by regular intervals of time.  For instance, retail data often has heavy yearly seasonality because people buy particular items at particular times of year, especially during the holidays.  Finally, the Random component is what's left over when we remove the Trend and Seasonal components.  You can read more about this technique here and here.

Let's hop into Power BI and make a quick time series chart.  We'll be using the same Customer Profitability Sample PBIX from the previous posts.  You can download it here.  If you haven't read Getting Started with R Visuals, it's recommended that you do so now.  Let's start by making a simple line chart of Total Revenue by Month.
Total Revenue by Year
Looking at this data, it seems that our Total Revenue is increasing over time.  However, it's difficult to know how strongly or if there is any seasonality.  This is where the Time Series Decomposition chart can help us.  Just as we did in the previous post, Clustering, we'll download this from the Power BI Marketplace.
Import From Marketplace
Power BI Visuals
Time Series Decomposition Chart
Time Series Decomposition Chart Description
Looking at the description of the Time Series Decomposition chart, we see that it requires the proto and zoo packages in R.  This will be important later in this post.  Let's scroll up and add this chart to our PBIX.
Add Time Series Decomposition Chart
Now, let's change our line chart to a time series decomposition chart by selecting the icon in the Visualizations pane.
Change to Time Series Decomposition Chart
Enable Script Visuals
Since we just opened this report, we need to enable R Script Visuals.

If you get this error, you need to install the zoo and proto R packages.  The previous post walks through this process.  You may need to save and reopen the PBIX after installing the packages to see the chart.
Time Series Decomposition of Total Revenue by Month
This chart is extremely interesting.  In order, these lines represent the actual data, seasonal, trend and remainder, also known as random, components.  However, this is too small to easily read.  Fortunately, the makers of this chart give us the option to look at one piece at a time.  Let's take a look at the trend first.
View Trend
This shows us the trend of our time series is red, compared to the actual data in light grey.  Now we have much more evidence to say that our trend was increasing over time.  However, we also see that the algorithm recorded a decreasing trend recently.  It also didn't seem to pick up on the recent spike at all.  We'll look into this later in this post.  For now, let's take a look at the seasonality.
Looking at this, it appears our revenue spikes every third month (March, June, September, December).  Unfortunately, the lack of detail on the horizontal axis makes this slightly frustrating to investigate.  Perhaps in a later post we'll crack open this R code and make some adjustments.  Let's move on to the remainder.
According to this algorithm, it looks like the algorithm was able to accurately predict earlier values in the series.  However, the more recent values in the series are showing some troubling variation.  This is apparent by looking at the actual data as well.  It's possible that our recent revenue values are not following the same pattern as last year.  This may be an opportunity to utilize an algorithm that doesn't consider time as a factor, such as Regression.
Interestingly, there's an option to display the "Clean" data, i.e. Trend and Seasonality without the Remainder.  This could be an interesting way to see how well the decomposition fits the data.  As we suspected, the recent months are causing an issue with the algorithm.

We could spend all day looking at all the data available here.  Instead, let's end by looking at one final aspect of this chart, Algorithm Parameters.
Algorithm Parameters
We have the option of tinkering with the parameters a little to make our algorithm better.
Clean (Degree)
By setting the Degree parameter to "On", we can sacrifice accuracy elsewhere in the series to be able to account for the recent spike.  Is this the "right" answer?  That's a much more complicated question.  Let us know your thoughts in the comments.

Hopefully, this post opened your eyes just a little to the possible of performing time series analysis within Power BI.  The custom visuals in the marketplace provide a strong "middle ground" offering that makes advanced analyses possible outside of hardcore coding tools like R and Python.  Stay tuned for the next post where we'll be talking about Forecasting.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions

Monday, August 6, 2018

Data Science in Power BI: Clustering

Today, we're going to talk about Clustering within Power BI.  If you haven't read the previous posts in this series, Introduction and Getting Started with R Scripts, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Clustering in Power BI.

Clustering is a statistical process by which we group records or observations into categories based on how similar they are to one another.
This is very easy to visualize in two dimensions.  We can see that clustering can be used to determine that there are three groups of observations, known as clusters, in our data.  As we add more dimensions, it becomes difficult, or even impossible, to visualize.  This is where clustering algorithms can help.  You can read more about Clustering here.

Fortunately, the Power BI Marketplace has a pre-built "Clustering" visual that uses R scripts on the back-end.  Let's check it out!

We start by opening the "Customer Profitability Sample PBIX" we downloaded in the previous post.  In case you haven't read that post, you can find the file here.
Import From Marketplace
Then, we navigate to the "Visualizations" pane and select "... -> Import From Marketplace".
Power BI Marketplace
This takes us to the "Power BI Marketplace".  Here, we can look at all of the custom visuals created by Microsoft and members of the community.
Power BI Marketplace (Cluster)
If we search for "cluster", we can add the "Clustering" custom visual by clicking the "Add" button.  This will add this custom visual as an option in our Power BI Report.
We can see that the "Clustering" custom visual has been added to the list.  Let's make a new page in the report and add an empty "Clustering" chart.
Enable Script Visuals
We were prompted to enable script visuals.  This is necessary when we utilize R scripts or custom visuals that utilize them.  If you get some type of error here, you may need to read the previous post to ensure your R environment in Power BI is correctly configured.
Can't Display This Visual
During this process, we also stumbled across this error.  The "Clustering" chart tries to load a number of R packages.  However, these R packages need to be installed on the machine in order for this to work.  Fortunately, it's pretty simple to install packages using RStudio.
Install Packages
We kept installing packages until we completed the list.  There is one package, Redmonder, that we could not install through RStudio.  Instead, we had to download the package from here and manually copy it to the R package directory.  In our case, this is
 Once we completed that, we faced our next challenge.  This custom visual does not allow us to use measures.  This means that basic questions like "How many clusters of customers exist based on Total Revenue and Total Labor Costs?" become more complex to solve.  Fortunately, it's not too difficult to turn the measures into calculated columns.  We do this by creating a new table in Power BI.
New Table
We start by selecting the "New Table" button in the "Modeling" tab at the top of the screen.
Customer Summary
Then, we use the following code to create a table "CustomerSummary".  This table contains one record for each Customer Name, along with the Total Revenue and Total Labor Costs associated to that Customer.


CustomerSummary = 
,"Total Revenue", SUM( 'Fact'[Revenue] )
,"Total Labor Costs", SUM( 'Fact'[Labor Costs Variable] )


You can read more about the SUMMARIZE() function here if you are interested.
Total Revenue and Total Labor Costs by Customer Name
Finally, we can create our Clustering chart by adding [Total Revenue] and [Total Labor Costs] to the "Values" shelf and [Name] to the "Data Point Labels" shelf.  The resulting chart is less than inspiring.  "Cluster 2" at the top-right of the chart contains two outlying points, leaving "Cluster 1" to contain the rest of the dataset.  The issue here is that K-Means Clustering is not robust against outliers.  You can more about K-Means Clustering here.

In order to make a more interesting set of clusters, we need a way to reduce the influence of these outliers.  One approach is to rescale the data so that extremely large observations aren't quite so large.  Two common functions to handle this are logarithm and square root.  Let's alter our code to take the square root of [Total Revenue] and [Total Labor Costs].


CustomerSummary = 
,"Total Revenue (Square Root)", SQRT( SUM( 'Fact'[Revenue] ) )
,"Total Labor Costs (Square Root)", SQRT( SUM( 'Fact'[Labor Costs Variable] ) )


Since we altered the field names, we also need to recreate our chart to contain the [Total Labor Costs (Square Root)] and [Total Revenue (Square Root] fields in the "Values" shelf.
Total Revenue (Square Root) and Total Labor Costs (Square Root) by Customer Name
We can see that we now have a more interesting set of clusters.  The outliers are still present, but not nearly as dominant.  This custom visual has quite a few more interesting options.  We encourage you to play around with it to create some visuals that would not be possible in standard Power BI.

Hopefully, this post opened your eyes to the Data Science possibilities in Power BI.  The power of custom R scripts has created a new world of potential, just waiting to be unleashed.  Stay tuned for the next post where we'll be talking about Time Series Decomposition.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions