Monday, September 1, 2014

Data Mining in Excel Part 18: Clustering

Today, we're going to talk about the next in the set of Data Mining algorithms, Clustering.
Clustering is our all-time favorite statistical procedure.  It requires no knowledge of the data at all and gives great insight as to what is actually in the data.  To our knowledge, it's one of the only statistical procedures that doesn't require you to ask a question.  For instance, regression algorithms require you to ask "What would my profit be if I ordered this new product?".  With a clustering algorithm, you simply throw data at it and the algorithm tells you what's important and what's not.  Let's see it in action.  As usual, we will be using the Data Mining Sample data set from Microsoft.
Select Data Source
Of course, the first step is to select our data source.  We could use an external SQL source if we wished, but we'll go ahead and use the table.
Column Selection
The first thing we need to choose is the number of clusters, or segments, that we want.  Typically, it's not a good idea to specify the number of clusters.  The algorithm has different criteria it uses to choose this value for us.  So, we'll let the algorithm do the hard work.  Also, we never want to use IDs in our analyses.  Let's check out what kinds of parameters we have available.
The clustering algorithm has more parameters than the decision tree algorithm.  The most important one to notice is the Clustering Method parameter.  Changing this value will likely have a serious impact on how your clusters are designed.  Perhaps a little foreshadowing here, but we may see that parameter again.  For more information on these parameters, look here.  Let's move on.
Training Set Selection
All of these algorithms require us to set aside a portion of the data for testing purposes.  We can keep the default of 30%.
Create Model
Finally, we need to create a structure and model to house the results.  Let's get to the analysis!
Cluster Diagram
The cluster diagram shows the relationships between the clusters.  The more similar two clusters are, the "stronger" the link between them will be.  The stronger the link, the darker the line.  You can show more or less links by using the slider on the left side of the window (enclosed in red below).  You can also examine the clusters by how dark they appear.  The shading of the clusters is determined by the box at the top of the screen (enclosed in brown below).
Cluster Diagram (Links and Population)
The shading variable is Population.  This means that the darkness of the cluster corresponds to the number of rows within it.  You can hover your cursor over any cluster to see it's size.  You can also change the shading value to see how certain values are distributed across clusters.  This is neat in some ways, but we prefer the Cluster Profiles for this type of analysis.
Cluster Profiles
We've created clusters in a number of programs.  We've even used Tableau in all of it's splendor.  However, the Microsoft Cluster Profiles view is by far the best view we've ever seen for inspecting clusters.  Let's zoom in on a couple pieces so we can see some things.
Cluster Profiles Zoom
In this view, each column is a cluster and each row is a variable.  The first cluster is actually the entire population.  They even do us the liberty of naming it "Population".  However, the real attraction here is the comparison of clusters.  With this view, we can scroll across a single row to see which variables distinguish each cluster.  For instance, Cluster 2 has mostly young people, while Clusters 1 and 3 are older.  Cluster 2 has quite a few people with zero cars, while Clusters 1 and 3 have mostly people with 2 cars.  We can repeat this process for all of the clusters and all of the variables to intuitively name our clusters if we chose.  For instance, we could call Cluster 2 "Young Adults with No Cars".  However, it gets slightly more complex because there are a quite a few other variables here we have to look at as well.  If your data has too many variables, you can use the parameters to force the algorithm to choose only the most distinct variables.

This is a great place to stop for today.  We've seen that the Clustering algorithm is a great way to get a good feeling for what's in your data, without having to do any manual investigation.  Stay tuned for the next post where we'll continue with our analysis and maybe even make some alterations.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

1 comment:

  1. Thank you for posting this. I have a question regarding clustering. Whenever I would input 100 in the "percentage of data for testing", the result does not give me the total size of the population? How can I create a cluster for 100% of the population?