Monday, June 2, 2014

Data Mining in Excel Part 5: Sampling from your Data

Today, we're going to talk about the last component of the Data Preparation Segment in the Data Mining Add-in for Excel, Sample Data.  Some of us have LOTS of data.  If you have a table with 1,000,000 rows of data and 50 columns, you're asking a lot from Excel when you try to run some analysis on that data.  So, a common practice is to randomly choose a handful of those rows, let's say 10,000, to do your analysis on.  The keyword in that sentence is "random".  If you choose the rows yourself, you may introduce some sort of unknown bias into your sample, which could heavily influence the results of your analysis.  So, we're going to let the tool do it for us.  As usual, we will be using the Data Mining sample data set from Microsoft.
Sample Data
The first step of any analysis is to select the source of data.
Select Source Data
This component is the first time we've seen the "External Data Source" option.  This is really cool.  Imagine you work for a giant bank that has a fact table for all of the transactions for all of your customers.  This fact table could easily be billions of rows.  That's way too much for Excel or Analysis Services to handle without some very powerful hardware.  Have no fear!  The Data Mining add-in will let you pick a few thousand rows from that table and play with them right in Excel.  Better yet, it will RANDOMLY pull those rows.  Also, you can right your own query here as well.  So, if you want to RANDOMLY select transactions from this year at all ATMs in the Southeast region, you can.  However, in our example, we only have 1000 rows of data.  Let's move on.
Select Sample Type
There are two types of sampling available here, Random Sampling and Oversampling.  Random sampling is the more common type.  It works just like flipping a coin or rolling a die.  No attention is paid to the data, it is only picked.  On the other hand, Oversampling is used to ensure that rare conditions are met.  For instance, if you want to run a test that compares millionaires to non-millionaires, then most random samples wouldn't have any millionaires in them at all.  It's difficult to test for something that doesn't occur in your data.  Back to our millionaires example, if you were to oversample your data, you could guarantee that at least 10% of your sample includes millionaires.  This does cause issues for some types of analysis, so only use it when you absolutely need to.  It's a very common practice in Fraud detection, where analysts are routinely looking for one transaction out of millions.  For now, let's just stick with a random sample.
Random Sampling Options
We're given two options.  We can either sample a fixed percentage of our data, or a fixed number of rows.  Let's stick with the default, 70%.
Random Sampling Finish
Finally, we get to name the sheet where the results will be stored.  Oddly enough, you can also choose to store the unselected data.  We can't think of any reason as to why you would want this data.  We will say that if you want to run repeated tests on samples, then you need to sample each time.  You can't create two random samples simply by using the selected and  unselected data from your first sample.  Now, let's go back and see what happens when we try Oversampling.
Oversampling Options
Here, we can decide which value we want to see more of.  We can also decide what percentage of our sample will have that value.  In fact, that percentage is guaranteed.  Consider this example.  You have a data set with 13 rows, 10 of which have yes and 3 of which have no.  If you oversample the data with a 30% chance of no and a sample of size 6, the algorithm will randomly select 2 of the 3 no values, and 4 of the 10 yes values.  However, if you increase the sample size to 13 (the entire data set), the algorithm will realize that there are only 3 no values, and restrict the sample size to 10.  This means that you will always have at least your requested percentage of a particular value.

These techniques are well respected in the scientific community and can be really useful when you want to explore and analyze really large data sets.  Stay tuned for the next post where we'll be talking about the "Analyze Key Influencers" component.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC

No comments:

Post a Comment