Monday, November 30, 2015

Level of Detail Calculations in Tableau Part 3: Distinct Counts

Today, we're going to talk about creating Distinct Counts in Tableau.  Primarily, we're going to focus on how to create them using Level of Detail (LoD) Calculations.  Then, we're going to compare LoDs to Table Calculations (TC) and Basic Calculations (BC) from the perspective of Ease of Creation, Flexibility and Performance.  We will not be covering in-depth how to create the Basic and Table Caclulation versions of the metrics.  COUNTD() is a built-in BC and you can read this post for the TC.  For these analyses, we will be using the Superstore Sales data set repeated 1000 times to create a data set that contains 10 million rows.

Metric 1: Total Distinct Rows

Let's start at the top level.  What if you wanted to know the Total Distinct Count of a field in your data set?
Distinct Rows (LoD)
Since we're looking for an overall Total, we want to use FIXED.  Our calculation in this case is COUNTD().  Let's check out the results.
Total Distinct Rows (LoD)
We can see that we get one number and you can take our word for it that this number is correct.  Now, on to the interesting part.  How does this compare to using BC or TC?  BC would simply require us to use the COUNTD() with no FIXED wrapper, whereas a TC would require us to add the Row IDs to the canvas, then aggregate them out using another TC.  

From a ease of creation standpoint, BC is by far the simplest because COUNTD() is a built-in function.  LoD comes in second because it's the same calculation wrapped in a FIXED expression.  TC comes very far in last because it requires multiple TCs wrapped around some non-sensical functions (SUM() of a MAX() to be precise).

From a flexibility standpoint, BC and LoD tie for first.  BC would allow you to add any dimensions you want to the chart, but the value will change to represent what's in the chart.  On the other hand, adding a dimension to the chart would not change the LoD.  However, you could change the way you calculate your LoD in order account additional dimensions.  Again, TC come in last because adding a new dimension to the chart would probably require you to update your Compute Using in order to allow for accurate results.

Finally, let's look at the Performance Monitor.
Total Distinct Rows (PM)
As we can see, there was no performance difference between BC and LoD, whereas TC took almost 10 seconds to load this simple calculation.  This is an easy decision

VERDICT: BC if you want it to respond to added dimensions, LoD if you don't.

Metric 2: Distinct Customers that purchased at least 1 item over $100 by Category

This question is a little more difficult.  We have to be able to identify which items cost over $100.  Fortunately, our data is already at the item level.  So, this problem comes down to using a traditional filter.  We want to start by adding [Sales] to the filter shelf.  Not SUM( [Sales] ), just [Sales].  You can achieve this by right-click-dragging [Sales] to the filter shelf, and selecting "All Values" or you can drag [Sales] to the dimensions shelf, then to the filter shelf.  The choice is yours.  This filter will remove all rows from the underlying data that have a [Sales] value less than 100.  Now, there are two ways we can approach this, using INCLUDE/EXCLUDE or using FIXED.  Let's start with INCLUDE/EXCLUDE.
Distinct Customers (INCLUDE)
Distinct Customers with an item over $100 by Segment (INCLUDE)
We saw in Part 2 of this series that INCLUDE/EXCLUDE LoDs are calculated AFTER traditional filters.  Therefore, simply making this an INCLUDE/EXCLUDE LoD with no dimension allows it to be calculated after the filter.  Let's look at the FIXED version.

Distinct Customers (FIXED)
Distinct Customers with an item over $100 by Segment (FIXED) (incorrect)
Hmmm.  There's an issue here.  The numbers aren't the same as before.  Part 2 also mentioned that FIXED LoDs are calculated BEFORE traditional filters.  This means that this LoD sees ALL Customers, not just the ones we've filtered for.  We can alleviate this by adding the Sales filter to the Context (which is created before FIXED LoDs).
Distinct Customers with an item over $100 by Segment (FIXED)
Voila.  It works.  Now, let's consider the comparison.

From an ease of creation standpoint, BC wins again due to the fact that we're using a built-in calculation.  LoD runs in a close second, but does require some extra knowledge about order of operations in Tableau.  Finally, TC is just as unwieldy here as it was before.

The story is the same as last time from a flexibility standpoint.  It all depends on what your next step is.  If you want to continually add and remove dimensions to flow through your analysis, then BC is what you want.  However, if you want these totals to stay the same while you run simultaneous analyses, then a FIXED LoD is what you need.  We can't think of any reason why you would want to use an INCLUDE/EXCLUDE LoD in this situation instead of BC, but that doesn't mean there isn't one.  Lastly, TC suffers from the same issues it did on the first metric.  Compute Using is not going to be your friend in this case.

Let's check out the Performance.
Distinct Customers with an item over $100 by Segment (PM)
There doesn't seem to any discernable difference between most of the Calculation Types.  This is actually surprising because of how poorly TC performed in the first test.  It just goes to show that TCs can be very useful when there aren't a large number of marks on the chart.

VERDICT: BC if you want it to respond to added dimensions, FIXED LoD if you don't.

Metric 3: Distinct Customers who spent at least $100 in 2013 by Segment.

While this metric may seem very similar to the previous one, there is one major difference.  Our data is not at the Year level, it's at the Order Line level.  That means that the underlying data set cannot be used to identify a Customer who spent $100 in 2013.  This means that BC is completely off the table.  Our only options now are TC and LoD.  Let's try it out.
Distinct Customers who spent at $100 (INCLUDE)
Distinct Customers who spent at least $100 in 2013 by Segment (INCLUDE)
This is when LoDs start to begin causing headaches.  We can filter the chart down to 2013 to start.  Next, we need to identify which customers spent more than $100.  Finally, we need to typecast the Boolean (T/F) values to Numeric (1/0) values.  Then, we can add those up.  An interesting thing to note about this data set is that each Customer ID exists in only 1 segment.  This means that we can use FIXED to exploit this for some extra efficiency.
Distinct Customers who spent at $100 (FIXED)
Distinct Customers who spent at least $100 in 2013 by Segment (FIXED)
Adding the filter to context, and changing the word INCLUDE to FIXED does the trick.  If the relationship between Customer ID and Segment wasn't perfectly cascading, then we would also need to have Segment in our FIXED partition.  Now, let's take a look at the comparison.

From an ease of use standpoint, they feel pretty similar.  It more comes down to what you're comfortable with.  Personally, we're more comfortable with TC because that's what we're much more experienced with.  However, the more we work with LoDs, the more see their value.

From a flexibility standpoint, we'd have to say that the INCLUDE LoD wins out.  It allows you to add any dimension to the chart you would like and the calculation will still work.  However, if you were add another dimension, Category for instance, you will get double counting because a customer is likely to have purchased from more than one Category.

Finally, let's look at the performance monitor.
Distinct Customers who spent at least $100 in 2013 by Segment (PM)
Turns out that there was no performance difference between INCLUDE and FIXED.  This is likely due to the cascading relationship between the two.  However, there was a massive performance difference (more than 3x) between TC and LoD.  Given what we've seen so far, we're willing to bet that this gap would grow even more as the number of marks gets larger.

VERDICT: LoD because of a significant performance increase.

Throughout this post, we've seen that there are multiple ways to approach each type of problem.  As far as distinct counts go, it seems that you should use COUNTD() whenever you can.  If you can't, then LoD is probably the next best option.  Thanks for reading.  We hope you found this informative.

The workbook for this post can be found here.

Brad Llewellyn
Business Intelligence Consultant
llewellyn.wb@gmail.com
http://www.linkedin.com/in/bradllewellyn

2 comments:

  1. If I wanted to add for instance a DATEPART('month',apt_dt)= 10).How would I do that.
    I've tried the following
    Sum( IF DATEPART('month',[appt_date])= 10 Then ( INT(
    { INCLUDE[customer id]:Sum ([Arrived Count]) < 2}) )end)
    I'm getting values back but it seems to be performing the fixed [customer id] instead of the include...If I was to chart this over the 12 months, month 10 shows 983 without the date and just Sum( INT(
    { INCLUDE [Customer ID]:Sum ([Arrived Count]) < 2}) ).
    when I include the datepart it goes to 288. 983 is correct. Could you give some tips on this.

    ReplyDelete
  2. Hi Brad,

    Thanks for the amazing post. Had a quick question on the metric 3 INCLUDE calculation.
    (INT({INCLUDE [Customer ID]:SUM([Sales])>=100}))

    and adding a sum prefix to it yields the same result.

    I am just a bit confused here.

    Thanks,
    Jeremiah

    ReplyDelete