Metric 1: Total Distinct Rows
Let's start at the top level. What if you wanted to know the Total Distinct Count of a field in your data set?
Distinct Rows (LoD) |
Total Distinct Rows (LoD) |
We can see that we get one number and you can take our word for it that this number is correct. Now, on to the interesting part. How does this compare to using BC or TC? BC would simply require us to use the COUNTD() with no FIXED wrapper, whereas a TC would require us to add the Row IDs to the canvas, then aggregate them out using another TC.
From a ease of creation standpoint, BC is by far the simplest because COUNTD() is a built-in function. LoD comes in second because it's the same calculation wrapped in a FIXED expression. TC comes very far in last because it requires multiple TCs wrapped around some non-sensical functions (SUM() of a MAX() to be precise).
From a flexibility standpoint, BC and LoD tie for first. BC would allow you to add any dimensions you want to the chart, but the value will change to represent what's in the chart. On the other hand, adding a dimension to the chart would not change the LoD. However, you could change the way you calculate your LoD in order account additional dimensions. Again, TC come in last because adding a new dimension to the chart would probably require you to update your Compute Using in order to allow for accurate results.
Finally, let's look at the Performance Monitor.
As we can see, there was no performance difference between BC and LoD, whereas TC took almost 10 seconds to load this simple calculation. This is an easy decision
VERDICT: BC if you want it to respond to added dimensions, LoD if you don't.
Metric 2: Distinct Customers that purchased at least 1 item over $100 by Category
This question is a little more difficult. We have to be able to identify which items cost over $100. Fortunately, our data is already at the item level. So, this problem comes down to using a traditional filter. We want to start by adding [Sales] to the filter shelf. Not SUM( [Sales] ), just [Sales]. You can achieve this by right-click-dragging [Sales] to the filter shelf, and selecting "All Values" or you can drag [Sales] to the dimensions shelf, then to the filter shelf. The choice is yours. This filter will remove all rows from the underlying data that have a [Sales] value less than 100. Now, there are two ways we can approach this, using INCLUDE/EXCLUDE or using FIXED. Let's start with INCLUDE/EXCLUDE.
We saw in Part 2 of this series that INCLUDE/EXCLUDE LoDs are calculated AFTER traditional filters. Therefore, simply making this an INCLUDE/EXCLUDE LoD with no dimension allows it to be calculated after the filter. Let's look at the FIXED version.
From a flexibility standpoint, BC and LoD tie for first. BC would allow you to add any dimensions you want to the chart, but the value will change to represent what's in the chart. On the other hand, adding a dimension to the chart would not change the LoD. However, you could change the way you calculate your LoD in order account additional dimensions. Again, TC come in last because adding a new dimension to the chart would probably require you to update your Compute Using in order to allow for accurate results.
Finally, let's look at the Performance Monitor.
Total Distinct Rows (PM) |
VERDICT: BC if you want it to respond to added dimensions, LoD if you don't.
Metric 2: Distinct Customers that purchased at least 1 item over $100 by Category
This question is a little more difficult. We have to be able to identify which items cost over $100. Fortunately, our data is already at the item level. So, this problem comes down to using a traditional filter. We want to start by adding [Sales] to the filter shelf. Not SUM( [Sales] ), just [Sales]. You can achieve this by right-click-dragging [Sales] to the filter shelf, and selecting "All Values" or you can drag [Sales] to the dimensions shelf, then to the filter shelf. The choice is yours. This filter will remove all rows from the underlying data that have a [Sales] value less than 100. Now, there are two ways we can approach this, using INCLUDE/EXCLUDE or using FIXED. Let's start with INCLUDE/EXCLUDE.
Distinct Customers (INCLUDE) |
Distinct Customers with an item over $100 by Segment (INCLUDE) |
Distinct Customers (FIXED) |
Distinct Customers with an item over $100 by Segment (FIXED) (incorrect) |
Hmmm. There's an issue here. The numbers aren't the same as before. Part 2 also mentioned that FIXED LoDs are calculated BEFORE traditional filters. This means that this LoD sees ALL Customers, not just the ones we've filtered for. We can alleviate this by adding the Sales filter to the Context (which is created before FIXED LoDs).
Distinct Customers with an item over $100 by Segment (FIXED) |
Voila. It works. Now, let's consider the comparison.
From an ease of creation standpoint, BC wins again due to the fact that we're using a built-in calculation. LoD runs in a close second, but does require some extra knowledge about order of operations in Tableau. Finally, TC is just as unwieldy here as it was before.
The story is the same as last time from a flexibility standpoint. It all depends on what your next step is. If you want to continually add and remove dimensions to flow through your analysis, then BC is what you want. However, if you want these totals to stay the same while you run simultaneous analyses, then a FIXED LoD is what you need. We can't think of any reason why you would want to use an INCLUDE/EXCLUDE LoD in this situation instead of BC, but that doesn't mean there isn't one. Lastly, TC suffers from the same issues it did on the first metric. Compute Using is not going to be your friend in this case.
Let's check out the Performance.
Distinct Customers with an item over $100 by Segment (PM) |
VERDICT: BC if you want it to respond to added dimensions, FIXED LoD if you don't.
Metric 3: Distinct Customers who spent at least $100 in 2013 by Segment.
While this metric may seem very similar to the previous one, there is one major difference. Our data is not at the Year level, it's at the Order Line level. That means that the underlying data set cannot be used to identify a Customer who spent $100 in 2013. This means that BC is completely off the table. Our only options now are TC and LoD. Let's try it out.
Distinct Customers who spent at $100 (INCLUDE) |
Distinct Customers who spent at least $100 in 2013 by Segment (INCLUDE) |
Distinct Customers who spent at $100 (FIXED) |
Distinct Customers who spent at least $100 in 2013 by Segment (FIXED) |
From an ease of use standpoint, they feel pretty similar. It more comes down to what you're comfortable with. Personally, we're more comfortable with TC because that's what we're much more experienced with. However, the more we work with LoDs, the more see their value.
From a flexibility standpoint, we'd have to say that the INCLUDE LoD wins out. It allows you to add any dimension to the chart you would like and the calculation will still work. However, if you were add another dimension, Category for instance, you will get double counting because a customer is likely to have purchased from more than one Category.
Finally, let's look at the performance monitor.
Distinct Customers who spent at least $100 in 2013 by Segment (PM) |
VERDICT: LoD because of a significant performance increase.
Throughout this post, we've seen that there are multiple ways to approach each type of problem. As far as distinct counts go, it seems that you should use COUNTD() whenever you can. If you can't, then LoD is probably the next best option. Thanks for reading. We hope you found this informative.
The workbook for this post can be found here.
Brad Llewellyn
Business Intelligence Consultant
llewellyn.wb@gmail.com
Business Intelligence Consultant
llewellyn.wb@gmail.com
http://www.linkedin.com/in/bradllewellyn
If I wanted to add for instance a DATEPART('month',apt_dt)= 10).How would I do that.
ReplyDeleteI've tried the following
Sum( IF DATEPART('month',[appt_date])= 10 Then ( INT(
{ INCLUDE[customer id]:Sum ([Arrived Count]) < 2}) )end)
I'm getting values back but it seems to be performing the fixed [customer id] instead of the include...If I was to chart this over the 12 months, month 10 shows 983 without the date and just Sum( INT(
{ INCLUDE [Customer ID]:Sum ([Arrived Count]) < 2}) ).
when I include the datepart it goes to 288. 983 is correct. Could you give some tips on this.
Hi Brad,
ReplyDeleteThanks for the amazing post. Had a quick question on the metric 3 INCLUDE calculation.
(INT({INCLUDE [Customer ID]:SUM([Sales])>=100}))
and adding a sum prefix to it yields the same result.
I am just a bit confused here.
Thanks,
Jeremiah