Monday, July 7, 2014

Data Mining in Excel Part 10: Highlight Exceptions

Today, we will be talking about one of the coolest, yet least common algorithms found in the Table Analysis Tools, Highlight Exceptions.
Highlight Exceptions
Imagine that you're a bank with a large transaction table and you want to detect fraudulent transactions.  One of the methods for doing this is to determine which transactions aren't like the others.  A mother typically withdraws $500 each wednesday and last friday she withdrew $2000?  Possible Fraud.  A frugal Dad typically makes small purchases on most days of the week and last weekend he spent over $3000?  Possible Fraud.  The question becomes "How do you find these occurrences ?"  This problem is precisely what the Highlight Exceptions tool is for.  Let's see how it works.  As usual, we will be using the Data Mining Sample data set from Microsoft.
Column Selection
Of course, we don't want to use the ID field.  Next, let's see what it does.
Exception Table
Here, we get to see which variables have the most "extreme" or "unusual" observations.  This is a somewhat complex process.  The algorithm doesn't simply look for extreme values in each column.  It finds unusual rows in the data and determines which column is likely making that row unusual.  In this case, the question becomes "Which attribute of this person is making him or her unusual?"  We also have the option of adjusting the threshold.  If we set the threshold to 100, we get no rows.
100 Threshold
Conversely, if we set the threshold to 0, we get every row.
0 Threshold
So, we have to choose a good middle ground.  The algorithm defaults to 75, which is good enough for now.  If you find that there are too many credible rows that the algorithm is selecting, the perhaps you should increase the threshold.  If you find that too many unusual rows are being missed, then decrease the threshold.  However, the cool part is still to come.  Let's go back to our original data set.
At first glance, there doesn't appear to be anything different about the rows.  But, if we filter by color, we see that all of the rows flagged by the algorithm are colored Yellow.
Moreover, the value which is likely causing the problem is highlighted in a different shade of Yellow.  Let's look at the first row, a male professional with partial college education, lots of cars, lots of kids, and makes $130,000 per year who works from home.  That's very unusual.  Perhaps there was a clerical error or something.  This could even be a real, yet rare person.  In fact, we actually know a person who fits criteria very similar to this.  Now, let's see which observations would drop off if we increased the threshold.
Less Exceptions
It turns out that the example we chose wasn't significant enough to meet the threshold of 80.  As you an see, this algorithm is extraordinarily useful and very easy to use.  On a negative note, we wish that the algorithm would append a column onto the data telling which threshold it falls out at.  For instance, the row used as an example would have a threshold somewhere between 75 and 80.  If the data had a column for this value, we could do some really interesting analysis with it.  However, since the data isn't altered in any way, we can't use this within Tableau to make any more cool charts.

On the other hand, there's nothing stopping you from combining the algorithms.  If you were to use the "Detect Categories" tool before using "Highlight Exceptions", you might find that some exceptions change thresholds.  What's unusual in relation to the whole data set, might not be as unusual when compared to other rows in the same category.  Check out our next post where we'll be looking at Goal Seek.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

No comments:

Post a Comment