Monday, August 25, 2014

Data Mining in Excel Part 17: Estimation

Today, we'll be talking about the next tool in the line of the Data Mining Tools, Estimate.
In the previous posts, Part 15 and Part 16, we used decision trees to predict a Yes/No variable.  Estimate uses a similar algorithm to predict numeric values.  The algorithm itself is a variation of linear regression.  Linear regression is arguably the simplest and most widely-used technique for estimating numeric values.  If you want to learn more about linear regression, read this.  Let's jump into it and we'll discuss more along the way.  As usual, we'll be using the Data Mining Sample data set from Microsoft.
Select Data Source
The first step is always to select a data source.  As we've previously discussed, you can also connect to a SQL Database to query your data source directly if you choose.  Let's move on.
Column Selection
Here's where things get interesting.  First, we must choose which variable we want to predict.  Then, the algorithm will automatically select the columns it thinks will create the best predictions.  It does this using the same technique we used in Part 6.  There are a number of reasons why you would not include a certain variable in your model.  For instance, if you knew that a variable was extremely unrealiable, you wouldn't want to use it.  In other cases, there are political reasons you can't use a variable, such as Race and Gender.  However, we don't see any reason you wouldn't include all of the rest of the variables regardless of what the algorithm recommends.  If the variable is a bad predictor, the algorithm will throw it away anyway.  You might as well give it a chance to really test it.  Alas, we digress.

In a combination decision tree/regression model like this, there are two types of predictors, Regressors and Categories.  As a simple example, let's try to predict Income using Age and Region.  The model itself looks like

Income = C1 * V1 + C2 * V2 + ...

where the C's are fixed values determined by the model and the V's are variables that come from the data.  When you set a value as a Regressor, Age in our example, then Age is thrown directly in as a variable, like so

Income = C1 * Age + C2 * V2 + ...

However, Region isn't a numeric variable at all, how would you put it in this formula at all?  There is a technique called Dummy Variables.  You can read about it here.  Basically, you create a variable for each value that the variable can take.  For instance, Region can be "North America", "Europe", and "Pacific".  So, you would create a 1/0 variable for each value of Region.  Mathematically, you can actually leave one of the variables out entirely and it won't matter, but we won't go that far.  Read the article if you're truly interested.  So, our model would become

Income = C1 * Age + C2 * [Region = North America] + C3 * [Region = Europe] + C4 * [Region = Pacific]

In summary, any value that you want treated as a number should be a regressor and any value you want treated as a category should just be an input.  However, this algorithm in particular looks at things in a slightly different way.  The Categories are used to create the nodes in the tree and the Regressors are used in the regression equation.  You'll see what we mean in a minute.  Let's look at our columns again.
Column Selection
As you can see, we set all of the numeric variables as regressors and the rest of the variables as inputs.  As usual, we don't include the ID or the value we're trying to predict.  Let's check out the parameters.
Remember how we said this is the same algorithm we used for Decision Trees?  Well that means that is also has the same parameters.  They can be very useful if you're trying to tweak the model to fit your needs.  We won't mess with it for now.  Let's keep moving.
Training Set Selection
As with all of the Data Mining Tools, we need to specify the size of the Training set.  We'll keep the default 70%.
Model Creation
Finally, we have to create a structure and a model to store our work.  These are housed directly on the Analysis Services server and can be recalled later if you wish.  Let's check out the results.
We get a pretty big tree on our first try.  If you're not familiar with the tree diagram, you can check out Part 15.  There are a couple of differences here.  Let's highlight them.

Just like the previous Decision Trees, each node can branch into further nodes.  However, notice the bars that we've highlighted in red?  Each node represents a chunk of our data set.  This diamond represents the distribution of the Incomes within the node.  The further to the left or right the diamond is tells us where the average Income falls within this chunk.  We see that all of these nodes have low incomes because they are far to the left.  The wider the diamond is, the more spread out the incomes are, meaning that our predictions are likely to be less accurate.  Despite being low incomes, these diamonds are all very small, which means we can be confident in the predictions.  Let's look at another piece of diagram.
Remember the C's and V's we saw earlier in our formula.  The C's are the coefficients and are represented by the values in brown.  The V's are the variables and are represented by the values in red.  Combining these two allows us to create the formula that you can see at the bottom of the screen.  We won't go into explaining this formula more than we already have.  It's interesting to note that each node gets its own formula.  Just remember this, the categories create the nodes and the regressors create the formula.  Let's move on.
Dependency Network
This algorithm has Dependency Networks just like the last one.  If you're unfamiliar with Dependency Networks, check out Part 16.

By now, you should be seeing some use for these algorithms.  Even if you're not, there's still much more to come.  You might be asking "A pretty picture is great; but, how do I get the predictions into the data?"  Unfortunately, that's a more involved task for another post.  We'll restate it again for everyone to hear, "If you want the simple things done easily, check out the Table Analysis tools.  But, if you want to do the cool stuff and really push Data Mining to its limits, you need to learn to use the Data Mining tools."  Keep an eye out for our next post, where we'll be discussing Clustering.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

No comments:

Post a Comment