Monday, December 9, 2013

Performing K-Means Clustering in Tableau

Today, we will talk about performing K-Means Clustering in Tableau.  In layman's terms, K-Means clustering attempts to group your data based on how close they are to each other.  If you want a rudimentary idea of how it looks, check out this picture.  Avid readers of this blog will notice that some of our previous posts, such as Simple Linear Regression and Z Tests, attempted to bring some hardcore statistical analysis into Tableau.  These posts required an extensive knowledge of statistics and Tableau.  Now, with Tableau 8.1's R integration, we can do even cooler stuff than that.  For those of you that don't know, R is a free statistical software package that is heavily used by Academic and Industry statisticians.  The inspiration for this post came from a post on the Tableau website.  You can read it here.

Microsoft's "Data Mining with SQL Server 2008" gave a perfect analogy for why you would want to use a clustering algorithm.  Feel free to read it here.  With this in mind, imagine that you have a bunch of demographic data about your customers, age, income, number of cars, etc.  Now, you want to find out what groups of customers you really have.  Take a look at some sample data we mocked up.
Sample Data
First, we need to examine our attributes.  In order to do this, we need to understand the difference between discrete and continuous attributes.  Simply put, a discrete attribute has a natural ordering, yet can only take a small number of distinct values.  Number of Children is a great example of this.  The values are numeric, which gives them a distinct ordering and there's a natural cap to how high this number can be.  It's extremely uncommon to see anyone with more than six children or so.  On the other hand, a continuous attribute can still be ordered, but takes far too many values for you to be able to list them all.  Income is a great example of this.  If you lined up all of your friends in a room, it's extremely unlikely that any of you would make the same exact amount of money.  An easy way to distinguish between these is to ask yourself, "Could I make a pie chart out of this attribute?"  If you answered yes, then the attribute is discrete.  If you answered no, then it is continuous.  Please note that this a gross oversimplification of these definitions, but they are good enough for this post.  Feel free to google them if you want to know more.

Now, let's see kinds of attributes we have.

Customer ID:           Unique Key
Age:                        Continuous
Education:                Discrete
Gender:                    Discrete*
Number of Cars:      Discrete
Number of Children: Discrete
Yearly Income:         Continuous

*Gender is technically a categorical attribute.  We'll touch back on this later.

Another important thing to note is that the K-Means Algorithm in R requires numeric input.  Therefore, we had to replace Education and Gender with numeric IDs.  Partial High School is a 1 and Doctorate Degree is a 6, with everything in the middle ordered appropriately.  Female is 0 and Male is 1.  This would have been an issue if we had a categorical attribute with more than two levels.

Now, we should also note that the R scripting functions qualify as Table Calculations.  Therefore, you need to set up your canvas before you can send the appropriate values to R.  Let's start by setting Customer ID on the Detail Shelf.  This defines the granularity of the chart.
Customer ID
Now, we need to create the calculated field that will create the clusters.  This code is commented (comments in R start with #) so that you can read it more easily.

SCRIPT_INT("
    ## Sets the seed

    set.seed( .arg8[1] )

    ## Studentizes the variables

    age <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
    edu <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
    gen <- ( .arg3 - mean(.arg3) ) / sd(.arg3)
    car <- ( .arg4 - mean(.arg4) ) / sd(.arg4)
    chi <- ( .arg5 - mean(.arg5) ) / sd(.arg5)
    inc <- ( .arg6 - mean(.arg6) ) / sd(.arg6)
    dat <- cbind(age, edu, gen, car, chi, inc)

    num <- .arg7[1]

    ## Creates the clusters

    kmeans(dat, num)$cluster
",    

MAX( [Age] ), MAX( [Education ID] ), MAX( [Gender ID] ),
MAX( [Number of Cars] ), MAX( [Number of Children] ), MAX( [Yearly Income] ),
[Number of Clusters], [Seed]
)

Basically, this code sends our six attributes to R and performs the clustering.  We also passed two parameters into this code.  First, we made a parameter that can change the number of clusters.  Second, we made a parameter that sets the seed.  A seed is what determines the output from a "random" number generator.  Therefore, if we set a constant seed, then we won't get different clusters every time we run this.  This is EXTREMELY important to what we are about to do.

Now, we want to examine our clusters on an attribute-by-attribute basis so that we can determine what our clusters represent. In order to do this, we made the following chart:
Clusters (Seed 500)
This chart is nothing more than a bunch of Shape charts, with the Transparency set to 0 so that we can't see the shapes.  Then, we put -1,+1 standard deviation shading references and an average reference line on each chart.  Next, our goal is to look at each chart to see how the clusters differ.  We will use very rough estimation when we look at these charts.  Don't beat yourself up over the tiny details of which box is bigger or which line is higher; this isn't an exact science.

On the Yearly Income chart, we see that Clusters 1 and 3 are pretty close, and Cluster 2 is much higher.  So, we'll say that Cluster 2 is "Wealthy."

On the Age chart, we don't see a significant difference between any of the Clusters.  So, we move on.

On the Education ID chart, we see that Clusters 1 and 3 are pretty close again, and Cluster 2 is much higher.  So, we'll call Cluster 2 "Educated."  Shocking Surprise!  Educated People seem to make more money.  It's almost like we built the data to look like this.  Anyway, moving on.

On the Gender ID chart, we see that Cluster 1 is almost entirely female and Cluster 3 is almost entirely male.

On the Number of Cars chart, we don't see a significant difference between the clusters.

On the Number of Children chart, we see that Cluster 3 has more children than the other clusters.  So, we'll call this Cluster "Lots of Children".

Now, let's recap our clustering:

Cluster 1: "Male."  We'll call this cluster the "Average Males"
Cluster 2: "Wealthy", "Educated."  We'll call this cluster "Wealthy and Educated"
Cluster 3: "Female", "Lots of Children."  We'll call this cluster "Females with Lots of Children"

Now, before anybody cries sexism at us, we will say that we intentionally created a relationship between income and education.  However, the fact that gender and number of children were clustered together were purely random chance.

This leads us to another question.  What if you don't like the clusters you got?  What if they weren't very distinguishable or you thought that the random number generator messed up the clusters.  Easy!  Just change the seed and/or the number of clusters.
Clusters (Seed 1000)
Changing the seed to 1000 completely changed our clusters.  This is what's so cool about statistics.  There are no right answers.  Everything is up for interpretation.

Now, you might ask, "If the clusters change every time, why is this even useful?"  That's the singular defining question behind statistics and the answer is typically, "Run it more than once."  If you create 10 sets of clusters and 9 of them pair High Income with High Education, then that's a VERY good indication that your data contains that cluster.  However, if you run it 10 times and find that half of the time it groups Men with High Income and the other half of the time it groups Women with High Income, then that probably means there is not a very strong relationship between Gender and Income.

We're sorry that we couldn't explain in more depth how the R functionality or the clustering algorithm works.  It would probably take an entire book to fully explain it.  Feel free to do some independent research on how clustering algorithms work and how to interpret the results.  We encourage you to experiment with this awesome new feature if you get a chance.  If you find a different, and maybe even better, way of doing this, let us know.  We hope you found this informative.  Thanks for reading.

P.S.

We would have much rather represented our discrete attributes, or categorical in the case of Gender, using some type of bar graph.  However, we were completely unable to find a way to get it to work.  If you know of a way, please let us know in the comments.

Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC
brad.llewellyn@mariner-usa.com
http://www.linkedin.com/in/bradllewellyn
http://breaking-bi.blogspot.com

15 comments:

  1. Hey Brad,

    Great post!

    I am trying to replicate this to identify members of a particular cluster. Although I am receiving an error within the kmeans function....

    Error in sample.int(m, k) :
    cannot take a sample larger than the population when 'replace = FALSE'

    My cluster calculation is below:

    SCRIPT_INT(
    "
    ## Sets the seed

    set.seed( .arg5[1] )

    ## Studentizes the variables

    fte <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
    fr <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
    below <- ( .arg3 - mean(.arg3) ) / sd(.arg3)
    dat <- cbind(fte, fr, below)

    num <- .arg4[1]

    ## Creates the clusters

    kmeans(dat, num)$cluster
    ",

    MAX( [Special Ed FTE Fixed] ), MAX( [FR Status Fixed] ), MAX( [Below Standard] ), [Number of Clusters], [Seed]

    )


    Any ideas on why I would receive this error?

    Thanks for your time

    ReplyDelete
  2. Brad,

    All I needed to do was turn off "Aggregate Measures" under Analysis

    ReplyDelete
    Replies
    1. Thanks for commenting! Glad you figured out the issue.

      Delete
  3. Hi Brad,
    How do you extract the clusters in table form once you use the R kmeans on tableau?
    Thanks

    ReplyDelete
    Replies
    1. Thanks for commenting! I'm not sure I understand your question. If you try using "Duplicate Sheet as Crosstab", you might find an answer. Does this make sense or am i misunderstanding?

      Delete
  4. Hi Sir,

    Im a highschool student and very new to R-script and Tableau, this is a question coming from a novice, I am trying to perform k-mean clustering and my script is

    SCRIPT_INT("
    ## Sets the seed

    set.seed( .arg6[1] )

    ## Studentizes the variables

    CP <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
    Emp <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
    GVA <- ( .arg3 - mean(.arg3) ) / sd(.arg3)
    Surv <- ( .arg4 - mean(.arg4) ) / sd(.arg4)
    dat <- cbind(CP, Emp, GVA, Surv)

    num <- .arg5[1]

    ## Creates the clusters

    kmeans(dat, num)$cluster
    ",

    MAX([CP]), MAX([Emp]), MAX([GVA]),
    MAX( [Surv] ),
    [Number of Clusters],[Seed]
    )

    I got an error saying:
    Reference to undefined field [Number of Cluster].
    Reference to undefined field [seeds].

    got any advice where or how I got this problem? Thanks so much for the help.

    ReplyDelete
    Replies
    1. In my workbook, I create a parameter for Number of Clusters and Seed. Therefore, I was able to use them in the calculation. Did you also create these parameters?

      Delete
    2. Thank you so much for the advice sir, I have an extra letter in the parameter section Number of Clusters instead of Number of Cluster, seedd instead of seeds. R language is really quite a challenge to master.

      Delete
  5. Thank you for this very interesting post.

    I'm also working on a Tableau + R integration using the k-means model.

    I'm trying to bring back the centers to the view as I would normaly do in R using: points(cl$centers, pch = 17, cex=2). But since Tableau only allows a vector of the same lenght to be brought back, I created two separated fields for the X and Y components:

    SCRIPT_REAL('
    set.seed(1234)
    param <- max(.arg3)
    result <- kmeans(x = data.frame(.arg1,.arg2), param)
    df <- data.frame(.arg1,.arg2)
    df2 <- cbind(df, result$cluster)
    colnames(df2)[3] <- "Cluster"
    df3 <- cbind(result$centers, c(1:param))
    colnames(df3)[3] <- "Cluster"
    df4 <- merge(df2, df3, by="Cluster")
    df4[,4]',
    SUM([Petal#Length]),SUM([Petal#Width]),[Parameter 1])

    Now if I would like to only return the distinct values for the centers and plot them but I can't use

    IF FIRST()==0 THEN WINDOW_SUM(COUNTD([X Centers])) END

    Because [X Centers] is already an aggregated field. Any idea on how I should procede ?

    ReplyDelete
    Replies
    1. Steven,

      Thanks for commenting! I think you're already there. If the vector you return to Tableau is the [X Center], but it is duplicated, then you could place your [X Centers] field on the Columns Shelf, with your arguments on the details shelf. This should create a 1-dimensional scatterplot with duplicate value on each [X Center]. Then, use the filter

      FIRST() == 0

      with Compute Using set to [Cluster] to remove all of the duplicates. This method requires that you also return the cluster number to Tableau, which shouldn't be an issue seeing that you've already computed it in your R code. Does this help?

      Delete
  6. I wanted to recreate the example but couldn't find the data. Could you please add the link (or am I missing something?)

    Thank you

    ReplyDelete
  7. Hi,
    I tried standardizing the data as mentioned in your post but I am getting NA/NAN error while running kmean clustering for a demographic dataset. I want to standardize the data since my parameters have different units. The problem is similar to the one that you have mentioned in your post.

    Below is the kmeans cluster code that I am using for normalization

    age <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
    income <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
    experience <- ( .arg3 - mean(.arg3) ) / sd(.arg3)

    ReplyDelete
  8. Also I want to understand the use of aggregate functions like SUM,MAX,MIN in SCRIPT functions as I have seen different codes with different functions.It will be great if you could explain the reason for selecting MAX function in your example.

    ReplyDelete
  9. Hi, i am new to statistics and Tableau. I am not quite getting the SD shading right - if you recall, did you leave the defaults? my shading is covering the entire range - did you choose scope of Per Pane? did you pick sample or population?

    ReplyDelete
  10. Hi, when I tried to run the below code and in spite of turning off "Aggregate Measures" under Analysis, i met with the error "Error in sample.int(m, k) :
    cannot take a sample larger than the population when 'replace = FALSE'.


    SCRIPT_INT("
    ## Sets the seed

    set.seed(.arg6[1])

    ## Studentizes the variables
    Overdue_Amount <- (.arg1 - mean(.arg1)) / sd(.arg1)
    Days_Late_paid <- (.arg2 - mean(.arg2) ) / sd(.arg2)
    Credit_Limit <- (.arg3 - mean(.arg3) ) / sd(.arg3)
    DSO_Days <- (.arg4 - mean(.arg4)) / sd(.arg4)
    dat <- cbind(Overdue_Amount, Days_Late_paid, Credit_Limit, DSO_Days)
    num <- .arg5[1]

    ## Creates the clusters
    kmeans(dat, num)$cluster
    ",

    MAX( [A Total Overdue Open Inv Amt] ), MAX( [Avg DL Paid Invs] ), MAX( [Cash Cus Credit Limit] ),
    MAX( [Cash Cus Dso Days]),
    [Number of Clusters], [Seed]
    )

    Can any one be of some help?

    ReplyDelete