Monday, August 4, 2014

Data Mining in Excel Part 14: Shopping Basket Analysis

Today, we'll be talking about the final Table Analysis Tool and one of most commonly misunderstood topics in all of data mining, Shopping Basket Analysis.
This is a topic we've discussed with multiple clients and analytics vendors.  Every client wants it and every vendor says they can do it.  However, most of the time, the vendors will give you charts showing average number of items in a basket, average value of a basket, most popular items, etc.  But, this is not what the term actually means.

Shopping Basket Analysis, aka Market Basket Analysis, is an algorithm that tells you which items tend to be bought together.  For instance, do diapers and beer tend to be bought together?  This is a funny, old story with questionable origins that has been shown to not be true.  The interesting thing about this whole area is that Analysis Services has a built-in algorithm to do this for you, using data the almost every retailer will have.  Let's look at the data.
 Data
This table is just a list of the items sold and their order numbers.  In fact, we've worked with quite a few different types of retailers and servicers that have data like this already.  Now, let's see the algorithm in action.
 Column Selection
First, we need to tell the algorithm what the ID is for each transaction.  This is usually known as the Order Number or Order ID.  Next, we need to tell the algorithm what the name of the item is.  In most cases, you don't want to use the actual product name.  We are interested in whether cereal and milk are typically bought together.  We aren't interested in whether Kellogg's Raisin Bran is purchased with Pet 2% milk.  Those answers are way too specific to be useful and are almost guaranteed to hurt your analysis.  So, we are using the category instead.  Optionally, you can select a value so that the algorithm can determine how much each grouping is worth.  Finally, there are a few advanced options we can look at.
Here, we can define the Minimum Support and Minimum Probability.  The support is total number of times that the grouping needs to appear in order to be valid.  For instance, if only one customer ever purchased batteries, and that customer also purchased milk, you don't want to say "100% of customers who buy batteries also buy milk."  So, we get to set a minimum support in order to ensure accuracy.  This value depends entirely on the size and distribution of your data set.  For our analysis, we'll use 10 items.

The probability is the number of times the rule is correct within the given support.  For instance, if 100 customers buy cereal, and 10 of those customers also buy ice cream, we may not want to make decisions based off of a 10% probability.  The algorithm has a default probability of 40% that works fine for us.  You can tune it for your needs if you wish.  Let's check out the results.
 Results
We see that Road Bikes and Helmets are bought together quite a bit.  We can also see Jersey and Road Bikes, Bottles and Cages, and many other combinations.  The one downside to this algorithm is that it doesn't exclude the obvious choices.  For instance, most business analysts will know that some items are bought together.  For instance, why would you buy a bottle without a cage to put it in?  So, this algorithm is only a starting point for your analysis.  However, it does offer you statistically valid results that you can use to make real decisions.  As much as we'd love to use Tableau to liven up the algorithm a little.  The results don't lend themselves to easy integration with the original data set.  Therefore, this is as far we go for today.

The great thing about this algorithm is that it doesn't just apply to retailers.  Banks could use this algorithm to see which packages customers tend to buy together.  Network Administrators could use this algorithm to see which issues tend to occur together.  The possibilities are limited only by your imagination.  Keep an eye out for the next post on Classification.  Thanks for reading.  We hope you found this informative.