Monday, November 18, 2013

Tableau vs. Power Pivot Part 10: Introductory Data Cleansing

Today, we will talk about Data Cleansing.  In short, data cleansing is the process by which you take raw data from a source and prepare it for end-user consumption.  Source data can come in all sorts of unusual formats.  The first step of the analytical process is to make sure that your data is "clean",  as it is being called.  As usual, we will use the Superstore Sales sample data set in Tableau, as well as a table from Wikipedia.

Category 1: Inconsistent Text Labels

This is an extremely common problem when combining data from disparate sources or sources with poorly regulated user entry.  In this example, we have taken the State names in our data set and changed some of them to the 2 letter abbreviation.  For example, some rows will say California, while other rows will say CA.  We want all of these rows to be categorized under the same state.  How do we do this?  First, let's see what the data looks like.
Sales by State (Dirty)
As you can see, we need some way to collapse these varying state names into the proper categories.  In order to do this, we need to create a table that maps CA and California to California, FL and Florida to Florida, and so on.
State Dimension
Now, let's see what Power Pivot can do with this.
Joining States to Orders (Power Pivot)
Technically, this is a dimension because it links directly to our Orders Table.  However, in a true Star Schema, this table would link to the State Dimension.  The process of adding dimension tables onto dimension tables is called Snowflaking.  Finally, let's check out the results.
Sales by State (Clean) (Power Pivot)
We simply use the clean states instead of the dirty ones, and Power Pivot takes care of the rest.  As you can see, there are no abbreviated states in the table on the right.  Now, let's try this in Tableau.  There are two ways we could do this.  We could use Tableau's Drag-and-Drop Joining feature or we could use Data Blending.  Since we used blending in a previous post, we will use joining here.
Excel Connection (Multiple Tables) (Tableau)
First, we need to define our Primary table, which is usually our fact table.
Add Table (Tableau)
Then, we link the primary table, Orders, to our secondary table, States.  Note that no coding was done here.  We simply selected options in the menus and clicked "Add."  Finally, let's see our results.
Sales by State (Tableau)
Just like Power Pivot, we see that we have no abbreviated state names in our list.  This was a pretty easy comparison because these tools used almost precisely the same technique.  There was only one major difference.  The process in Tableau uses upfront processing power to join the data sets, whereas Power Pivot must calculate that join every time we use that field.  However, we have no evidence to say that this process costs any significant processing power.  If this were done on an extraordinarily large scale, would it cause a serious difference is processing time?  We do not know for sure.  Alas, we cannot sway our decision based off of pure speculation.  Therefore, we are forced to call this a tie.

EDIT: Zen Master Jonathan Drummey pointed out that you can achieve similar results using "Aliases" within Tableau.  In fact, we'd recommend using this method instead of the join for a couple of reasons.  First, it doesn't require the data to be in the same source, which is required for joining.  Second, this way allows you to maintain the groupings within the Tableau workbook itself, instead of an external Excel file.  Just goes to show that with these versatile tools, there's always another option you can try.

Winner: Tie

Category 2: Pivoting

We often see data that is in a reporting style tabular format with measures across the columns.  However, there are some times when you think that data would be better represented as a single measure with an added dimension.  This is called pivoting.  Let's look at the following table headings from a data set from Wikipedia.

Crime Table Headings
As you can see, there are 7 different types of crimes listed here.  However, in it's current format, how would you aggregate certain types of crimes?  For instance, how would you get the total number of crimes?  You would have to individually type the names of every single column, then sum that total.  Heaven forbid you try to do something complex such as a standard deviation of all crimes.  That would be a nightmare to create.

So, what we really want to do is pivot these crimes into two columns, Crime and Cases / 100k, which is the unit of measure of these columns.  We could possibly do this with Excel.  However, it would be a somewhat complicated task.  Therefore, we are going to introduce another free add-in for Excel 2013 called Power Query.  Power Query specializes in data transformation and manipulation.

You might say "Why are you talking about Power Query in a Tableau vs. Power Pivot examination?"  Power Query is part of Microsoft's Power BI set, which also includes Power Pivot, Power View, and Power Map.  It's also a free tool for use with Excel 2013 and is great for handling these types of issues.  Now, let's see how we would do this.
Unpivoting Crimes
We simply select all the crime columns and right-click, then select "Unpivot Columns."  Power Query does the rest of the work for us.  After a little more cleaning, such as changing column names and types, we have the following:
Crime (Clean)
If you look on the right side of the window, you can see a list of steps we took.  We can backtrack any number of steps if we wanted to.  We could also save this query and run it again at a later time.  Now, this data set is now ready for use in our tools.  When we save this table, it will be saved as a basic Excel table.

There is a similar add-in for Excel that comes from the Tableau community.  It is commonly known as the Tableau Data Shaping Tool.  It's basic premise is to take a "Report-Style" data set and turn it into a "Table-Style" data set.  It has three basic features.  First, it can pivot.
Pivot Data (Tableau Data Shaping Tool)
It can also copy data from a pivot table, with no formatting, which can also be done via Paste Values.  Lastly, it can fill down, which can be immensely helpful
Fill Down (Tableau Data Shaping Tool)
As you can see, this tool is useful, yet very rudimentary in its design.  It is no match for the shaping power of Power Query.  But, since neither of these tools play any significant amount of favoritism towards Tableau or Power Pivot, we cannot declare a winner.  We recommend downloading both of these add-ins because they can be extremely useful at times.

Conclusion

For some of the basic data cleansing procedures, such as typecasting and adding snowflaked dimensions (The procedure in the first section), Power Pivot and Tableau are about equal.  However, once your needs get more advanced, it is advised that you look for a more powerful tool, such as Power Query.  It is important to note that Power Query publishes to Excel 2013.  Therefore, we are limited by the 1 million row limit.  If you need to surpass the 1 million row limit, you will need to look elsewhere.  Thanks for reading.  We hope you found this informative.

EDIT: Chris Webb pointed out that Power Query can publish straight to the Power Pivot data model.  This means it can bypass Excel's 1 million row limit if you need.  This is just the tip of the iceberg for Power Query.  It can do so much more!

Brad Llewellyn
Associate Data Analytics Consultant
Mariner, LLC
brad.llewellyn@mariner-usa.com
http://www.linkedin.com/in/bradllewellyn
http://breaking-bi.blogspot.com

5 comments:

  1. Hi Brad,

    Thanks for continuing this series! Tableau has a helpful third option for cleansing data, using Aliases. Since they are not changing the underlying data, they can be really useful for situations where we need the 'messy' list of values but want to display a clean list for users. Also, for dimensions with a small number of values such as booleans, aliases will be faster to assign than any of the other options.

    ReplyDelete
    Replies
    1. You are correct sir! That feature slipped my mind. Thanks for sharing.

      Delete
  2. Power Query has the option to publish straight to the Excel Data Model (and this is even easier in the latest version), so there is no 1 million row limit.

    ReplyDelete
  3. Very impressive article. Thanks for the writing.

    ReplyDelete