Monday, September 15, 2014

Data Mining in Excel Part 20: Association

Today, we'll be talking about one of the more complex data mining algorithms, Association.
The association algorithm is commonly known as Market Basket Analysis or Shopping Basket Analysis.  It attempts to discern which events tend to happen together.  For example, "Which items do customers tend to buy together?" or "Which websites do people tend to surf to in the same visit?".  If you have groups of anything, this algorithm can help you determine which items are "associated".  Let's check it out.
Select Data Source
The first step is always to select your data source.  Interestingly enough, you can't connect to an external SQL Source using this algorithm.  In a later post, we'll talk about ways to work around this.  Let's move on.
Transaction and Item Identification
Here, we need to tell the algorithm which fields correspond to Transaction ID and Item.  The transaction is the simple one, usually an Order ID or a Visit Number.  The item is a bit more interested.  For a retailer, you'd think that an item would be an individual product.  If you were to use Product for Item, you would get a bunch of silly rules saying "People who buy Pet 2% milk tend to buy Hershey's Cookies and Cream Cereal Puffs".  That rule is way too specific to be of any use.  A more reasonable rule would be "People who buy Milk tend to buy Cereal".  This is why we use Category instead to get more useful rules.

Next, we need to tell the algorithm what the minimum support and minimum rule probabilities need to be.  The support of a rule is the number of times the item appeared.  The probability is the percentage of times that the items appeared together relative to the number of times they appear separately.  Let's say you have a rule that says "People who buy Milk tend to buy Cereal".  The support would be the number of times that people bought Milk.  The percentage would be the percentage of those orders that also had Cereal.  If 10 people bought Milk and 5 of those also bought cereal, then the support would 10 and the probability would be 5 / 10 = .50 = 50%.  We're fine with using the default values here.  Let's check out the parameters.
Here, we have a bunch of values similar to the support and percentage we talked about earlier.  We're okay with keeping these defaults for now.  For more information, read this.  Moving on.
Create Model
Finally, we need to create our mining structure and model.  Once we do this, we can browse the model.
Rules Browser
This is the Rules Browser.  It let's you see all of the rules that the algorithm built from your data, as well as their probability and importance.  We already talked about what probability is.  Importance is new for us though.  Importance, aka Lift, is a type of probability that tells us how, for lack of a better word, "important" the rule is.  For instance, if every customer buys Milk and Eggs, then a rule saying "People who buy Milk also buy Eggs" isn't very useful.  You could put Eggs and Milk on completely different sides of the store and it wouldn't matter because everyone is going to buy them anyway.  However, if only 10% of your customers buy Milk, but they all buy Eggs as well, then that rule becomes much more useful.  Therefore, importance is typically what you want to look for.  Let's zoom in on a couple areas of this browser.
Probability and Importance Filters
These two filters will allow you to filter out rules that are too uncommon or unimportant for your analysis.
Rule and Display Filters
Here, you can filter for rules that contain certain keywords and choose what you would like have to displayed.
Finally, we can see some rules.  Let's check out the first rule.  It has a pretty high probability of .7 and a high importance of .855.  The rule says "If the customer buys Socks and Mountain Bikes, they are also likely to buy Fenders."  The second rules says "If the customer buys Vests and Mountain Bikes, they are also likely to buy Fenders."  These rules can provide extraordinary insight into your business if you approach them with some business knowledge in hand.  A good business analyst will look at some of these rules and say "Duh!  I already knew that".  The real value here lies in the rules that make the analyst stop and say "Hmm.  I didn't know that.  Let me dig into this a little deeper."  Let's look at the Itemset Browser.
Itemsets Browser
This view shows you something a little different.  It shows you which items are more commonly bought together.  This is very useful if you want to know which combinations of items are selling, without having worry about things like importance.  Let's check out the Dependency Network.
Dependency Network
We've seen a couple of these before.  So, we won't spend too much time talking about it.  But this does give us an idea about how different products relate with one another.  For instance, there are 5 different categories that predict the purchase of Tires and Tubes.  In reality, the Itemsets Browser showed us that Tires and Tubes is by far the most popular category.  Therefore, this isn't much of a surprise.  However, did you know that people who buy Gloves or Touring Bikes tend to also buy Helmets?  These are the kinds of insights you find if you put for the effort to find them.

The association algorithm is actually one of the most complex algorithms commonly used today.  The good thing is that Excel makes it pretty easy to work with.  Stay tuned for our next post where we'll be talking about Forecasting.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

No comments:

Post a Comment