Monday, January 12, 2015

Using the Field Preparation Tools in Alteryx

Today, we're going to talk about the tools with the Preparation segment of the Alteryx toolbar.  More specifically, we're going to talk about the tools that alter the fields within your data set.  To be honest, we could write an entire blog post on each tool in this category.  However, we'll keep it high level for now.  We'll be using the Pet Store Monthly Sales Data from Alteryx.

Field Preparation Tool 1: Select
This tool is one of the more useful tools in Alteryx.  It allows you to choose which fields you would like pass through to later segments.  It also allows you rename the fields, change their data types, and add descriptions.
Renaming Fields and Changing Data Types
Regardless of the type of analysis, you will be using this tool quite a bit.

Field Preparation Tool 2: Formula
This tool is the primary workhorse for any calculations you want to do.  If you want to perform any operation between fields in your data set, this is how you it.
Total Sales
Here, we simply calculated the Total Sales as the Sum of Sales for each month.  This operation could also be done using the Transpose tool that we'll talk about in a later post.  Simply put, if you want to calculate a value using values in the SAME ROW, this tool will probably do the job.  The formula does way more than arithmetic though.  It has a large number of functions for manipulating text, changing data types, and performing advanced calculations.  It's definitely worth the time to look through the functions under the "Functions" tab of the Formula Properties window.

Field Preparation Tool 3: Multi-Row Formula
Multi-Row Formula
This tool is an extension of the formula tool we just saw.  It allows you to apply a formula across MULTIPLE ROWS.  This can really good when you want to calculate running totals or differences from last year.
Duplicate Customer
Here, we used the tool to find if there are any duplicate customers in the data set.  It should be noted that this tool does not sort your data.  So, it's a very good idea to sort the data before using this tool.  Otherwise, you have no idea what Row-1 is!

This tool actually has quite a few interesting features.  One of more interesting features is the GROUP BY.  Let's say that you wanted to calculate a running sum that restarts every year.  You could GROUP BY Year and calculate [Row - 1:Running Sum] + [Sales].  You should also take note of the "Values for Rows that don't Exist" options.  These become very useful when you are using GROUP BY because if you are the first row in the GROUP BY, Row - 1 doesn't exist.

Field Preparation Tool 4: Multi-Field Formula
Multi-Field Formula
This tool is another extension of the Formula tool.  However, this apply to multiple fields instead of multiple rows.  It's exceptionally useful when you want to apply the same calculation to a number of fields, but don't want to repeat the same thing over and over again in the formula tool.
Monthly Sales by Thousands
Here, we created a simple calculation that reports the Monthly Sales by Thousands instead of by Dollars and saved them as new fields suffixed with (K).  What makes this tool really cool is that you can perform calculations using the Name and Data Type of the fields as well.  Alas, we've never found much analytical power in this tool.  If your data set is shaped properly, then there are other tools that can do most of these types of calculations.  If your data set isn't shaped properly, then you can easily use some of the other tools to reshape your data so that calculations become easier and more efficient.

Field Preparation Tool 5: Record ID
Record ID
This tool is incredibly simple, yet undeniably useful if used properly.  All it does is add another field to your data set that tells you which row you are looking at.
Record ID Properties
It's primary use comes when you need to sort your data to perform a certain calculation, but need to be able to revert to the original sorting for display purposes.  It's also useful if you are dealing with large amounts of data and need to be able to make Surrogate Keys.  Basically, a surrogate key is an integer that corresponds to a row in your data set.  It's useful when you will doing large joins because joins perform exponentially quicker on integers than they do on text.  Read this if you want to learn more about surrogate keys.

Field Preparation Tool 6: Tile
This tool adds another field to your data by tiling or grouping a numeric field into "buckets".  This process goes by a number of different names, three of which we just used.  Technically, the process is called "Discretization" and is useful when you want to reduce the number of unique values in a field.  For instance, if you are dealing with age, you probably don't want to treat 23 differently than 24.  So, you could group them into buckets of 10, i.e. 10-19, 20-29, etc.  This tool will do that for you in a slightly more mathematically sound way.
Discretization of Total Sales
Here, we created buckets for Total Sales.  The tool has a few different methodologies you can use to create your buckets.  We won't go into detail on them at this time.  The best part of this tool is that it can output Names as well, i.e. Low, Medium, High.  If you are interesting in more detail on Discretization, read this.

Field Preparation Tool 7: Multi-Field Binning
Multi-Field Binning
Just like Multi-Field Formula was an extension of Formula, Multi-Field Binning is an extension of Tile.  It simply applies the same discretization logic to multiple fields at the same time.  This tool does NOT group multiple fields into a single bucket, it simply saves the time of using multiple tile tools.
Discretizing Monthly Sales
Strangely, you lose the ability to use the more advanced algorithms like "Smart Tile" and get names for the bins.  Losing those features forces us to call this tool "Situational on a good day".

Field Preparation Tool 8: Imputation
Imputation Tool
This tool allows you to replace certain values within your data.  It is most commonly used to replace missing values. But, depending on the structure of your data, you may want to use it to replace 0, -1, or some other value you use to signify a bad or missing value.  The underlying process is called Imputation and is very well-researched in the statistical community.
Imputing Missing Values
The most common method for imputing values is by replacing them with the average or median of the rest of the values.  For instance, if the Bob's Burgers didn't report their sales to you for January, you can estimate their sales as the average of all their competitor's sales for that same month.  If you want to get more technical, you can replace those sales with Bob's Burgers sales from December or estimate them as the average of December and February.  Unfortunately, you'd need to use a different tool, such as Multi-Row Formula, to handle those issues.  For more information on Imputation, read this.

Field Preparation Tool 9: Auto-Field
Despite the somewhat ambiguous name, this tool is used for resizing text fields.  You give the tool a list of text or numeric fields and it reduces them to the smallest possible size to fit all of the records.  For instance, if you had state names in a field, the auto-field would reduce the field size to 14 in order to fit North Carolina and South Carolina.
Resizing Text Fields
The interface is pretty slim for this one because there's not much user input.  This tool is pretty useful in some situations.  We once had to deal with an extremely large variable-width text file.  In order to accomodate the completely unknown text sizes, we set all the fields to size 255 and resized them later using Auto-Field

This was a great showcase of the Field Preparation Tools in Alteryx.  We saw that you can accomplish most of the computational tasks using these tools.  Stay tuned for the next post where we'll be talking about the Row Preparation Tools.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit