Breaking BI: Data MIning in Excel Part 19: More Clustering

Today, we're going to continue our analysis of the Clustering algorithm.

Clustering

If you haven't read the previous post, we highly recommend you read it because this picks up right where it left off. We just finished looking at the Cluster Profiles. Let's move on to the Cluster Characteristics.

Cluster Characteristics

While the previous views allowed us to compare clusters to each other. This view allows us to see what makes up each cluster individually. This view is great for naming our clusters. We could select the first few variables and use those as names. So, we could call Cluster 1 (Older, Married, North American Home Owners with 2 Cars). That's a pretty good description of a group of people that took almost no effort to get. We could repeat this process for the remaining clusters if we wanted to. But, let's move to the final view, Cluster Discrimination.

Cluster Discrimination (1 vs. 2)

This view is very similar to what we just saw. However, this view shows us what's important in each cluster, as compared to another cluster. For instance, we can see what really distinguishes Cluster 1 from Cluster 2. Imagine you had a cluster that buys your products and another that doesn't. We could use this view to come up with a list of candidate attributes that may have an impact on customers' buying habits. We can even use this view to compare a cluster to everything that isn't in that cluster. This is known as a complement.

Cluster Discrimination (1 vs. Not 1)

This view is perfect for determining which attributes uniquely define a cluster. You might be wondering how this differs from the Cluster Characteristics. The Cluster Characteristics view shows you what's in a cluster. The Cluster Discrimination view shows you what's in a cluster that's NOT in other clusters. So, you could use this view to develop unique names for your clusters.

Wait a minute! That's the third view in a row that we could use to name our clusters. Which view should you use? That's up to personal choice and how the names will be used. If you want a 1-stop shop for all of your information in a graphical format, the Cluster Profiles is a great place to start. It also looks nice if you were ever presenting your results. If you want to let the algorithm determine which features are important for your naming convention, then use the Cluster Characteristics or the Cluster Discrimination. Personally, we think the Cluster Discrimination view is the most statistically sound way to do it. Alas, the choice is yours.

In Statistics, there's a concept called "Robustness". Basically, a robust model doesn't change very much if you try to tweak it. Robutness is a very good thing that every model should have. Imagine that you're a baseball coach. Would you rather have a pitcher that can play well in all conditions, or a pitcher that can only play well when the sun's out, the temperature is 75 degrees and he's facing West? It's pretty obvious; you want consistency, in your pitchers and your statistical models. So, how do we make sure that our model is robust? Let's check out the parameters.

Parameters

The first parameter we should notice is Cluster Seed. This parameter determines which row the clustering algorithm uses to create the first cluster. If you try a few different values here and the clusters don't change much, then the model is pretty robust.

The second parameter we should notice is Clustering_Method. This parameters determines which of four different clustering algorithms get used to create the clusters. The primary methods are 1 (E-M) and 3 (K-Means). If you change this parameter and the clusters don't change much, then the model is pretty robust.

The question is "How do we know if the clusters changed?" Unfortunately, we're not that far along yet. We're still looking at the algorithms. Have no fear, we'll soon start talking about how to take these models and get tangible results out of them. Keep an eye out for the next post where we'll be talking about Associations. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit
llewellyn.wb@gmail.com

http://www.linkedin.com/in/bradllewellyn

Breaking BI

Monday, September 8, 2014

Data MIning in Excel Part 19: More Clustering

No comments:

Post a Comment