Breaking BI: Conducting a 2-sample Z Test in Tableau

Wednesday, March 6, 2013

Conducting a 2-sample Z Test in Tableau

Yesterday, we posted a blog about finding the median of a set of Likert Values. This got us thinking that maybe basic statistics are possible in Tableau. As it turns out, they are. Today, we will look at how to conduct a 2-sample Z Test in Tableau. A 2-population Z Test determines whether two samples from normally distributed populations have the same mean. As usual, we used the Superstore Sales sample data set in Tableau.

EDIT: This instance is definitely a case of "foot-in-mouth" syndrome. Later in this post, I use the incorrect formula to calculate the P-value. However, the rest of the procedure is valid, just not the P-value. Even so, this is still a good example of showing the many uses of table calculations. Enjoy!

Step 1:

Determine what two sets of data you would like to compare

For our example, we chose profits for 2011 and 2012, rolled up to the customer level. We can see these distributions below.

Histograms

For those of who know how to look at normal histograms, you can probably see that these are VERY similar. Therefore, we should expect our test to return a high P-value (> .05).

Step 2:

Aggregate your data and divide it into your two samples.
Create the following calculated fields.

2011 Profit

2012 Profit

Step 3:

Find the means and standard deviations of your two sets.
Create the following calculated fields.

2011 Profit Sample Mean

2012 Profit Sample Mean

2011 Profit Standard Deviation

2012 Profit Standard Deviation

There are two important things to note here. First, these are all window functions along [Customer]. This is because we already aggregated the data up to the [Customer] level, and Tableau only gives us one option for a secondary aggregation. Second, a discerning reader might say, "You're using sample standard deviations! Shouldn't these be t tests?" The answer is a resounding, "Yes!" However, to our knowledge, Tableau does not have access to the Gamma function (Google it if you care what it is.). This makes it impossible to calculate a P-value. So, we're stuck with a z test, which is almost identical given our large sample size.

Step 4:

Calculate your pooled standard deviation
Create the following calculated fields

2011 Customers

2012 Customers

Pooled Profit Standard Deviation

The first two calculated fields are distinct counts of the number of customers. We exploited the "Compute Along" feature to simplify the calculation to a simple sum of ones.

Step 5:

Calculate the Z Score
Create the following calculated field

Z Score

Step 6:

Calculate the P-value
Create the following calculated field

P-value

As you can see, this is a two-sided P-value, you can make it one sided if you so choose. However, now we can look at the final result.

EDIT: I had a big DUH! moment earlier today. The formula I used to calculate the p-value is incorrect. It should be an integral, which Tableau is incapable of calculating. In spite of this, the Z-statistic is still valid, you just have to use some other means to calculate a rejection region and/or a P-value. On the bright side, maybe some type of statistical package integration is in the works. Cheers!

Final Table

The one caveat is that you have to put [Customer] somewhere in the canvas in order for this method to work. We thought the level of detail would be the easiest. As we suspected, the P-value is extremely high, meaning that we don't have enough evidence that say that these two samples come from populations with different means.

WHEW! Wasn't that a lot of work? We think so too. It seems that we've shown that Tableau is capable of basic statistical testing. However, you are literally doing all of the work from scratch. It would probably be much easier to combine Tableau's graphical capabilities with a more specialized tool, such as R or SAS. On the bright side, there's an idea on the Tableau forums to put some sort of R integration in Tableau. Here's to hoping!

In all seriousness though, if you need advanced statistical procedures, you're probably better off using a specialized tool and exporting the values manually. I hope you found this informative. Thanks for reading.

Brad Llewellyn

Associate Consultant

Mariner, LLC

llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

8 comments:

UnknownJune 20, 2013 at 8:39 AM
Brad,

I wish to know how to explain use of these statistical test to laymen ? Can you pls let me know bit more of below line ?

As we suspected, the P-value is extremely high, meaning that we don't have enough evidence that say that these two samples come from populations with different means.
ReplyDelete
Replies
UnknownJune 21, 2013 at 2:25 AM
Yes. Bit clear. Imagine i new to world of statistics and histograms.
How well it brings decision making capabilities or should i use these kind of graphs for decision making.

Why not comparative analysis or my basic bar and line or 23 types of tableau charts. I want to understand its core and basic use. Can you help ?

I studied a lot of statistics, p value, normal distribution etc. where does it really used in real life and how it will be benefitted to clients. from your exp, i wish to know.
ReplyDelete
Replies
SuryaJune 12, 2014 at 9:59 AM
Brad, Thanks for the awesome blog. I was wondering if Tableau would allow us to do these tests at different levels with the same data set. Now, the data is at customer level and say, we add one more categorical variable like Location (North, South being two values).

So, I need Z statistic for the following pairs:
2011 Profit of North Vs 2012 Profit of North
2011 Profit of South Vs 2012 Profit of South
2011 Profit Vs 2012 Profit

*I'll have a drop down to select North and South.

Appreciate your help! Expecting a reply soon!

Thanks,
Surya
ReplyDelete
Replies

Add comment

About Me

Brad is a Service Engineer on Microsoft's FastTrack for Azure Team in Charlotte, NC. Brad helps individuals and organizations leverage Analytics and Azure to revolutionize themselves and their industries. He has an M.S. in Statistics from the University of South Carolina, MCSE Certification in Data Management and Analytics, MCSE Certification in Cloud Platform and Infrastructure and various MCSA Certifications in Business Intelligence and Advanced Analytics. Brad is an active blogger at breaking-bi.blogspot.com. He is also an organizer for the Charlotte BI Group, a local PASS chapter in Charlotte, NC. You can connect with him on LinkedIn at https://www.linkedin.com/in/bradllewellyn and on Twitter @BreakingBI.