Category 1: Inconsistent Text Labels
This is an extremely common problem when combining data from disparate sources or sources with poorly regulated user entry. In this example, we have taken the State names in our data set and changed some of them to the 2 letter abbreviation. For example, some rows will say California, while other rows will say CA. We want all of these rows to be categorized under the same state. How do we do this? First, let's see what the data looks like.
Sales by State (Dirty) |
State Dimension |
Joining States to Orders (Power Pivot) |
Sales by State (Clean) (Power Pivot) |
Excel Connection (Multiple Tables) (Tableau) |
Add Table (Tableau) |
Sales by State (Tableau) |
EDIT: Zen Master Jonathan Drummey pointed out that you can achieve similar results using "Aliases" within Tableau. In fact, we'd recommend using this method instead of the join for a couple of reasons. First, it doesn't require the data to be in the same source, which is required for joining. Second, this way allows you to maintain the groupings within the Tableau workbook itself, instead of an external Excel file. Just goes to show that with these versatile tools, there's always another option you can try.
Winner: Tie
Category 2: Pivoting
We often see data that is in a reporting style tabular format with measures across the columns. However, there are some times when you think that data would be better represented as a single measure with an added dimension. This is called pivoting. Let's look at the following table headings from a data set from Wikipedia.
As you can see, there are 7 different types of crimes listed here. However, in it's current format, how would you aggregate certain types of crimes? For instance, how would you get the total number of crimes? You would have to individually type the names of every single column, then sum that total. Heaven forbid you try to do something complex such as a standard deviation of all crimes. That would be a nightmare to create.
So, what we really want to do is pivot these crimes into two columns, Crime and Cases / 100k, which is the unit of measure of these columns. We could possibly do this with Excel. However, it would be a somewhat complicated task. Therefore, we are going to introduce another free add-in for Excel 2013 called Power Query. Power Query specializes in data transformation and manipulation.
You might say "Why are you talking about Power Query in a Tableau vs. Power Pivot examination?" Power Query is part of Microsoft's Power BI set, which also includes Power Pivot, Power View, and Power Map. It's also a free tool for use with Excel 2013 and is great for handling these types of issues. Now, let's see how we would do this.
We simply select all the crime columns and right-click, then select "Unpivot Columns." Power Query does the rest of the work for us. After a little more cleaning, such as changing column names and types, we have the following:
Category 2: Pivoting
We often see data that is in a reporting style tabular format with measures across the columns. However, there are some times when you think that data would be better represented as a single measure with an added dimension. This is called pivoting. Let's look at the following table headings from a data set from Wikipedia.
Crime Table Headings |
So, what we really want to do is pivot these crimes into two columns, Crime and Cases / 100k, which is the unit of measure of these columns. We could possibly do this with Excel. However, it would be a somewhat complicated task. Therefore, we are going to introduce another free add-in for Excel 2013 called Power Query. Power Query specializes in data transformation and manipulation.
You might say "Why are you talking about Power Query in a Tableau vs. Power Pivot examination?" Power Query is part of Microsoft's Power BI set, which also includes Power Pivot, Power View, and Power Map. It's also a free tool for use with Excel 2013 and is great for handling these types of issues. Now, let's see how we would do this.
Unpivoting Crimes |
Crime (Clean) |
There is a similar add-in for Excel that comes from the Tableau community. It is commonly known as the Tableau Data Shaping Tool. It's basic premise is to take a "Report-Style" data set and turn it into a "Table-Style" data set. It has three basic features. First, it can pivot.
Pivot Data (Tableau Data Shaping Tool) |
Fill Down (Tableau Data Shaping Tool) |
As you can see, this tool is useful, yet very rudimentary in its design. It is no match for the shaping power of Power Query. But, since neither of these tools play any significant amount of favoritism towards Tableau or Power Pivot, we cannot declare a winner. We recommend downloading both of these add-ins because they can be extremely useful at times.
Conclusion
For some of the basic data cleansing procedures, such as typecasting and adding snowflaked dimensions (The procedure in the first section), Power Pivot and Tableau are about equal. However, once your needs get more advanced, it is advised that you look for a more powerful tool, such as Power Query. It is important to note that Power Query publishes to Excel 2013. Therefore, we are limited by the 1 million row limit. If you need to surpass the 1 million row limit, you will need to look elsewhere. Thanks for reading. We hope you found this informative.
EDIT: Chris Webb pointed out that Power Query can publish straight to the Power Pivot data model. This means it can bypass Excel's 1 million row limit if you need. This is just the tip of the iceberg for Power Query. It can do so much more!
EDIT: Chris Webb pointed out that Power Query can publish straight to the Power Pivot data model. This means it can bypass Excel's 1 million row limit if you need. This is just the tip of the iceberg for Power Query. It can do so much more!
Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC
brad.llewellyn@mariner-usa.com
http://www.linkedin.com/in/bradllewellyn
http://breaking-bi.blogspot.com
http://breaking-bi.blogspot.com
Hi Brad,
ReplyDeleteThanks for continuing this series! Tableau has a helpful third option for cleansing data, using Aliases. Since they are not changing the underlying data, they can be really useful for situations where we need the 'messy' list of values but want to display a clean list for users. Also, for dimensions with a small number of values such as booleans, aliases will be faster to assign than any of the other options.
You are correct sir! That feature slipped my mind. Thanks for sharing.
DeletePower Query has the option to publish straight to the Excel Data Model (and this is even easier in the latest version), so there is no 1 million row limit.
ReplyDeleteGood catch! I'll be sure to edit that in.
DeleteVery impressive article. Thanks for the writing.
ReplyDelete