Monday, March 10, 2014

Predictive Analytics in Tableau Part 11: How Data Moves Between R and Tableau

Today, we will end this series on an exploratory note.  We've wondered throughout this series precisely how data travels between R and Tableau and what precautions need to be taken to ensure that we are getting accurate results.  So, we've designed a sample data set that will allow us to test a couple of scenarios and develop solutions for solving this problem.  Let's take a look at our data.
Data
First, we have a complete data set with years ranging from 2000 to 2014.  We also have five different numeric fields.  The first is ordered the same as the years.  The second is ordered in reversed/descending order.  The third is simply random.  The fourth is the same as the first, but with the years 2003 and 2008 missing.  The fifth is the same as the random one, but with the years 2003 and 2008 missing.  Now, let's start by passing ascending data and seeing if we get it back in the same order.
Ascending (No Sort)
Ascending (No Sort) by Year
As expected, R returned the data back exactly like it should have.  Now, let's try it with the random data to see if the values matter (they shouldn't).
Random (No Sort)
Random (No Sort) by Year
Again, no issues.  Finally, let's see if the missing values cause any problems.
Missing (No Sort)
Missing (No Sort) by Year
We finally have an interesting result.  Apparently, the SCRIPT_REAL function can't return NULL values.  R refers to these null values as NA.  This leads us to an interesting dilemma.  Is Tableau passing a Null and receiving a 0, or is Tableau passing a 0 initially.  We can test this.  First, let's show a simple example.
NA + 1
This picture shows that R is just like most languages in that you can't perform arithmetic on Null values.  So, let's use this to test what value R is receiving.
Missing (+ 1)
If R is receiving NAs, then this function should return NA, which Tableau would read as 0.  However, if Tableau is passing 0, then this function should return 1.  Let's see what happens.
Missing (+ 1) by Year
We see that Tableau is in fact passing 0 instead of Null.  This seems awkward to us.  Before we jump to any conclusions, let's confirm using a different method.
is.na
R has this nifty function called is.na() that returns true if the value is null.  So, let's run this test using Tableau.
Missing (is.na)
Missing (is.na) by Year
We have confirmed that Tableau is not passing Null values to R.  We don't see any reason why Tableau couldn't just pass NA in place of the Nulls.  Perhaps it has something to do with Rserve?  If you know anything, please let us know in the comments.  This leads us to another issue.  How do we combat this?  We have an idea, but let's see how R handles it.
Coercing NA
This is great!  It seems that if we pass the values to R as strings, then anything that isn't a number should be turned into NA.  This leads us to another question.  What does Tableau pass in place of a Null for Strings?
Missing (String)
Missing (String) by Year
We see that Tableau coerces Null values into the string value "Null".  This is great.  Now, let's see what happens when pass it to R and coerce it back to a number.
Missing (String Passed)
Missing (String Passed) by Year
Success!  We now know that if you have missing data, you can fix the problem by coercing them to strings before you pass them to R, then coercing them back to numbers in R.  However, this is a VERY serious issue.  It's shocking to us that a tool like Tableau would lack such simple functionality as Null handling.  We've put a post out of the Tableau forums to see if other people see it the same way.  You can see the post here.

EDIT:  Apparently this issue has been fixed in Tableau 8.1.3.

Now, let's move on to ordering.  What would happen if we passed the descending values to R, then sorted them and passed them back to Tableau?
Descending (Ascending Sort)
Descending (Ascending Sort) by Year
As you can see, Tableau was completely unaware that the data was sorted and sent it back in the order it was received.  This is a scary, yet unsurprising, result.  There are ways to handle this though.  Let's propose a real scenario for this.  Imagine that you want to create a time series that predicts the next value.  As we've seen in the last few posts, the data must be ordered in ascending order if you want to create a time series.  So, what would happen if we tried to pass in an unordered time series?
Unordered Time Series (Year)
Unordered Time Series (Value)
Unordered Time Series (Table)
As you can see, the time series gets returned in ascending order, regardless of the order in which it was received.  You might ask "Who would look at a time series out of order like you have in this table?"  That's where the interesting idea comes in.  Consider the traditional way you would look at time series data.
Time Series (Ordered)
How do you know what order the data is in?  Can you assume that the data is in ascending order?  Well, if we overlay Unordered Time Series (Year), we will see if any of the years don't match up.
Time Series (Ordered) with Labels
Following the labels, we see that everything is in order.  Is this always the case?  Our data was originally in ascending order by Year.  What if we were to reorder it?  When Tableau creates an extract, it automatically sorts the data.  We're not sure of the precise algorithm, but we do know that it was sorting our data by date.  So, we added another date in a different order in an attempt to trick Tableau into ordering the data improperly.
Data (Disordered)
Now, will our line chart still be in the correct order?
Time Series (Attempted Disordered)
As you can see, it doesn't matter whether the data is ordered or not.  These line charts always seem to be ordered by Date.  This is a very good thing.  It adds robustness to our procedures.  Now that we know these charts are always ordered, we can actually prevent a possible catastrophe.
Null Sorted
Notice how the NA gets removed when we sort?  If we were to sort data in R when we had missing data, we would be removing our missing data, which would have huge repercussions.  Namely, you can pass ten values to R and receive nine in return.  Tableau will toss an error if this happens.  Fortunately, this isn't a problem for our situations.  However, keep this in mind if you do find yourself in a situation where you are sorting data that may be missing.

In short, Tableau expects to receive data in the order in which it passes it.  Therefore, if you need to order your data, order it in Tableau BEFORE you pass it to R.  Tableau also doesn't pass Null numeric values to R, it passes zeros.  If you want to pass numeric data that may be missing, you may need to coerce the data to strings first.  Thanks for reading.  We hope you found this informative.

P.S.  Zen Master Jonathan Drummey has done a similar analysis and it can be found here.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

1 comment:

  1. Hi Brad,

    Is there any way I can show only the forecast value in the chart and hide actual values?

    Thanks

    ReplyDelete