Comments on Breaking BI: Performing K-Means Clustering in Tableau

Hi, when I tried to run the below code and in spit...

2016-07-28T07:22:37.253-04:00

Hi, when I tried to run the below code and in spite of turning off "Aggregate Measures" under Analysis, i met with the error "Error in sample.int(m, k) :
cannot take a sample larger than the population when 'replace = FALSE'.

SCRIPT_INT("
## Sets the seed

set.seed(.arg6[1])

## Studentizes the variables
Overdue_Amount <- (.arg1 - mean(.arg1)) / sd(.arg1)
Days_Late_paid <- (.arg2 - mean(.arg2) ) / sd(.arg2)
Credit_Limit <- (.arg3 - mean(.arg3) ) / sd(.arg3)
DSO_Days <- (.arg4 - mean(.arg4)) / sd(.arg4)
dat <- cbind(Overdue_Amount, Days_Late_paid, Credit_Limit, DSO_Days)
num <- .arg5[1]

## Creates the clusters
kmeans(dat, num)$cluster
",

MAX( [A Total Overdue Open Inv Amt] ), MAX( [Avg DL Paid Invs] ), MAX( [Cash Cus Credit Limit] ),
MAX( [Cash Cus Dso Days]),
[Number of Clusters], [Seed]
)

Can any one be of some help?

Hi, i am new to statistics and Tableau. I am not ...

2016-07-15T15:59:31.987-04:00

Hi, i am new to statistics and Tableau. I am not quite getting the SD shading right - if you recall, did you leave the defaults? my shading is covering the entire range - did you choose scope of Per Pane? did you pick sample or population?

Also I want to understand the use of aggregate fun...

2015-07-30T21:39:44.478-04:00

Also I want to understand the use of aggregate functions like SUM,MAX,MIN in SCRIPT functions as I have seen different codes with different functions.It will be great if you could explain the reason for selecting MAX function in your example.

Hi, I tried standardizing the data as mentioned i...

2015-07-30T21:38:41.657-04:00

Hi,
I tried standardizing the data as mentioned in your post but I am getting NA/NAN error while running kmean clustering for a demographic dataset. I want to standardize the data since my parameters have different units. The problem is similar to the one that you have mentioned in your post.

Below is the kmeans cluster code that I am using for normalization

age <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
income <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
experience <- ( .arg3 - mean(.arg3) ) / sd(.arg3)

I wanted to recreate the example but couldn't ...

2015-06-23T12:51:04.119-04:00

I wanted to recreate the example but couldn't find the data. Could you please add the link (or am I missing something?)

Thank you

Steven, Thanks for commenting! I think you'r...

2014-11-10T09:12:19.007-05:00

Steven,

Thanks for commenting! I think you're already there. If the vector you return to Tableau is the [X Center], but it is duplicated, then you could place your [X Centers] field on the Columns Shelf, with your arguments on the details shelf. This should create a 1-dimensional scatterplot with duplicate value on each [X Center]. Then, use the filter

FIRST() == 0

with Compute Using set to [Cluster] to remove all of the duplicates. This method requires that you also return the cluster number to Tableau, which shouldn't be an issue seeing that you've already computed it in your R code. Does this help?

Thank you for this very interesting post. I'm...

2014-11-10T00:01:21.750-05:00

Thank you for this very interesting post.

I'm also working on a Tableau + R integration using the k-means model.

I'm trying to bring back the centers to the view as I would normaly do in R using: points(cl$centers, pch = 17, cex=2). But since Tableau only allows a vector of the same lenght to be brought back, I created two separated fields for the X and Y components:

SCRIPT_REAL('
set.seed(1234)
param <- max(.arg3)
result <- kmeans(x = data.frame(.arg1,.arg2), param)
df <- data.frame(.arg1,.arg2)
df2 <- cbind(df, result$cluster)
colnames(df2)[3] <- "Cluster"
df3 <- cbind(result$centers, c(1:param))
colnames(df3)[3] <- "Cluster"
df4 <- merge(df2, df3, by="Cluster")
df4[,4]',
SUM([Petal#Length]),SUM([Petal#Width]),[Parameter 1])

Now if I would like to only return the distinct values for the centers and plot them but I can't use

IF FIRST()==0 THEN WINDOW_SUM(COUNTD([X Centers])) END

Because [X Centers] is already an aggregated field. Any idea on how I should procede ?

Thank you so much for the advice sir, I have an ex...

2014-05-12T05:56:34.827-04:00

Thank you so much for the advice sir, I have an extra letter in the parameter section Number of Clusters instead of Number of Cluster, seedd instead of seeds. R language is really quite a challenge to master.

In my workbook, I create a parameter for Number of...

2014-05-11T14:27:18.385-04:00

In my workbook, I create a parameter for Number of Clusters and Seed. Therefore, I was able to use them in the calculation. Did you also create these parameters?

Hi Sir, Im a highschool student and very new to R...

2014-05-11T08:20:55.776-04:00

Hi Sir,

Im a highschool student and very new to R-script and Tableau, this is a question coming from a novice, I am trying to perform k-mean clustering and my script is

SCRIPT_INT("
## Sets the seed

set.seed( .arg6[1] )

## Studentizes the variables

CP <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
Emp <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
GVA <- ( .arg3 - mean(.arg3) ) / sd(.arg3)
Surv <- ( .arg4 - mean(.arg4) ) / sd(.arg4)
dat <- cbind(CP, Emp, GVA, Surv)

num <- .arg5[1]

## Creates the clusters

kmeans(dat, num)$cluster
",

MAX([CP]), MAX([Emp]), MAX([GVA]),
MAX( [Surv] ),
[Number of Clusters],[Seed]
)

I got an error saying:
Reference to undefined field [Number of Cluster].
Reference to undefined field [seeds].

got any advice where or how I got this problem? Thanks so much for the help.

Thanks for commenting! I'm not sure I underst...

2014-03-24T19:33:44.039-04:00

Thanks for commenting! I'm not sure I understand your question. If you try using "Duplicate Sheet as Crosstab", you might find an answer. Does this make sense or am i misunderstanding?

Hi Brad, How do you extract the clusters in table ...

2014-03-24T17:08:47.841-04:00

Hi Brad,
How do you extract the clusters in table form once you use the R kmeans on tableau?
Thanks

Thanks for commenting! Glad you figured out the i...

2014-01-16T09:18:00.768-05:00

Thanks for commenting! Glad you figured out the issue.

Brad, All I needed to do was turn off "Aggre...

2014-01-15T20:02:23.214-05:00

Brad,

All I needed to do was turn off "Aggregate Measures" under Analysis

Hey Brad, Great post! I am trying to replicate t...

2014-01-15T19:40:49.230-05:00

Hey Brad,

Great post!

I am trying to replicate this to identify members of a particular cluster. Although I am receiving an error within the kmeans function....

Error in sample.int(m, k) :
cannot take a sample larger than the population when 'replace = FALSE'

My cluster calculation is below:

SCRIPT_INT(
"
## Sets the seed

set.seed( .arg5[1] )

## Studentizes the variables

fte <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
fr <- ( .arg2 - mean(.arg2) ) / sd(.arg2)
below <- ( .arg3 - mean(.arg3) ) / sd(.arg3)
dat <- cbind(fte, fr, below)

num <- .arg4[1]

## Creates the clusters

kmeans(dat, num)$cluster
",

MAX( [Special Ed FTE Fixed] ), MAX( [FR Status Fixed] ), MAX( [Below Standard] ), [Number of Clusters], [Seed]

)

Any ideas on why I would receive this error?

Thanks for your time