Monday, August 25, 2014

Data Mining in Excel Part 17: Estimation

Today, we'll be talking about the next tool in the line of the Data Mining Tools, Estimate.
In the previous posts, Part 15 and Part 16, we used decision trees to predict a Yes/No variable.  Estimate uses a similar algorithm to predict numeric values.  The algorithm itself is a variation of linear regression.  Linear regression is arguably the simplest and most widely-used technique for estimating numeric values.  If you want to learn more about linear regression, read this.  Let's jump into it and we'll discuss more along the way.  As usual, we'll be using the Data Mining Sample data set from Microsoft.
Select Data Source
The first step is always to select a data source.  As we've previously discussed, you can also connect to a SQL Database to query your data source directly if you choose.  Let's move on.
Column Selection
Here's where things get interesting.  First, we must choose which variable we want to predict.  Then, the algorithm will automatically select the columns it thinks will create the best predictions.  It does this using the same technique we used in Part 6.  There are a number of reasons why you would not include a certain variable in your model.  For instance, if you knew that a variable was extremely unrealiable, you wouldn't want to use it.  In other cases, there are political reasons you can't use a variable, such as Race and Gender.  However, we don't see any reason you wouldn't include all of the rest of the variables regardless of what the algorithm recommends.  If the variable is a bad predictor, the algorithm will throw it away anyway.  You might as well give it a chance to really test it.  Alas, we digress.

In a combination decision tree/regression model like this, there are two types of predictors, Regressors and Categories.  As a simple example, let's try to predict Income using Age and Region.  The model itself looks like

Income = C1 * V1 + C2 * V2 + ...

where the C's are fixed values determined by the model and the V's are variables that come from the data.  When you set a value as a Regressor, Age in our example, then Age is thrown directly in as a variable, like so

Income = C1 * Age + C2 * V2 + ...

However, Region isn't a numeric variable at all, how would you put it in this formula at all?  There is a technique called Dummy Variables.  You can read about it here.  Basically, you create a variable for each value that the variable can take.  For instance, Region can be "North America", "Europe", and "Pacific".  So, you would create a 1/0 variable for each value of Region.  Mathematically, you can actually leave one of the variables out entirely and it won't matter, but we won't go that far.  Read the article if you're truly interested.  So, our model would become

Income = C1 * Age + C2 * [Region = North America] + C3 * [Region = Europe] + C4 * [Region = Pacific]

In summary, any value that you want treated as a number should be a regressor and any value you want treated as a category should just be an input.  However, this algorithm in particular looks at things in a slightly different way.  The Categories are used to create the nodes in the tree and the Regressors are used in the regression equation.  You'll see what we mean in a minute.  Let's look at our columns again.
Column Selection
As you can see, we set all of the numeric variables as regressors and the rest of the variables as inputs.  As usual, we don't include the ID or the value we're trying to predict.  Let's check out the parameters.
Remember how we said this is the same algorithm we used for Decision Trees?  Well that means that is also has the same parameters.  They can be very useful if you're trying to tweak the model to fit your needs.  We won't mess with it for now.  Let's keep moving.
Training Set Selection
As with all of the Data Mining Tools, we need to specify the size of the Training set.  We'll keep the default 70%.
Model Creation
Finally, we have to create a structure and a model to store our work.  These are housed directly on the Analysis Services server and can be recalled later if you wish.  Let's check out the results.
We get a pretty big tree on our first try.  If you're not familiar with the tree diagram, you can check out Part 15.  There are a couple of differences here.  Let's highlight them.

Just like the previous Decision Trees, each node can branch into further nodes.  However, notice the bars that we've highlighted in red?  Each node represents a chunk of our data set.  This diamond represents the distribution of the Incomes within the node.  The further to the left or right the diamond is tells us where the average Income falls within this chunk.  We see that all of these nodes have low incomes because they are far to the left.  The wider the diamond is, the more spread out the incomes are, meaning that our predictions are likely to be less accurate.  Despite being low incomes, these diamonds are all very small, which means we can be confident in the predictions.  Let's look at another piece of diagram.
Remember the C's and V's we saw earlier in our formula.  The C's are the coefficients and are represented by the values in brown.  The V's are the variables and are represented by the values in red.  Combining these two allows us to create the formula that you can see at the bottom of the screen.  We won't go into explaining this formula more than we already have.  It's interesting to note that each node gets its own formula.  Just remember this, the categories create the nodes and the regressors create the formula.  Let's move on.
Dependency Network
This algorithm has Dependency Networks just like the last one.  If you're unfamiliar with Dependency Networks, check out Part 16.

By now, you should be seeing some use for these algorithms.  Even if you're not, there's still much more to come.  You might be asking "A pretty picture is great; but, how do I get the predictions into the data?"  Unfortunately, that's a more involved task for another post.  We'll restate it again for everyone to hear, "If you want the simple things done easily, check out the Table Analysis tools.  But, if you want to do the cool stuff and really push Data Mining to its limits, you need to learn to use the Data Mining tools."  Keep an eye out for our next post, where we'll be discussing Clustering.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

Monday, August 18, 2014

Data Mining in Excel Part 16: More Classification

Today, we're going to continue our demonstration of the Classify algorithm.
In our previous post, we showed how to use this algorithm to build a basic classification tree.  However, the tree only branches once, which is not very useful.  Let's recreate the model changing one of the parameters so that we can get a bigger tree for demonstration purposes.
Complexity of .25
The Complexity Parameter affects how large the tree can grow.  By lowering the value to .25, we allow the tree to grow more than before.  Let's check out the new results.
Browse (.25 Complexity)
Right away, we can see that this tree is more complex than before.  We also see another attribute being displayed now.  Notice how some of the nodes are darker shades of blue than the others?  The darkness of the node indicates the number of rows it holds.  We can see that Node 3 (Age >= 39 and < 67) is much darker than the rest of the nodes.  See the + at the edge of the middle nodes in the last column?  These denote that there are more nodes beyond.  To display these nodes, we can either click the + or move the slider at the top of the window.
Expanding the Tree
Now, we have a few variables at play in this tree.  Let's click on the Dependency Network tab and see what's over there.
Dependency Network
This graph shows us all of the variables in our tree and whether they are used as predictors, responses, or both.  Since our model only has one response, it's a pretty clean network.  But these can get much more complex as the models grow.  Let's see what happens when we click on the response, Purchased Bike.
Dependency Network (Purchased Bike)
See that all of the predictors turn red, just like the legend at the bottom says.  Now, what happens if we select a predictor?
Dependency Network (Children)
We see that Purchased Bike turns blue, just like the legend says.  Unfortunately, our model doesn't allow for a variable to both a predictor and a response, so we don't get to see purple.  Notice the slider on the far left of the window?  What happens if we slide that down?
Dependency Network (3 Links)
Every notch we slide it down, another variable drops off based on how well it predicts Purchased Bike.  If we slide it halfway down, we are left with Cars, Age, and Commute Distance, which are the same three variables we saw in the first few levels of the classification tree.  This gives you a good idea of which variables are important for your predictions and which aren't.

Let's say that you accidentally closed the browser and didn't get to look at your model.  You can revisit the browser for the model any time you wish by using the Browse tool.
This gives you access to any model stored in your Analysis Services Instance.
Select Model
Right now, all we have is two models.  We'll have quite a few more by the time we finish this blog series.  Feel free to explore the multitude of ways you can use Decision Trees with your data.  All it takes is a binary variable and the willingness to put in the effort.  Stay tuned for our next post where we'll be talking about Estimation.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

Monday, August 11, 2014

Data Mining in Excel Part 15: Classification

Today, we will be discussing the first of the Data Mining Tools, Classify.
In our previous posts, we focused on the Table Analysis Tools.  Those tools are very narrow, targetted ways to handle a single issue.  They offer very little in the area of customization and can only be used on Excel tables, but are also very simple to use.  The Data Mining Tools are one step more advanced.  These tools offer some customization with the added complexity that comes with it.  They also have the added functionality of being able to query your SQL Database directly using SQL.  This is immensely useful if you are dealing with a data set that exceeds Excel's million row limit.

The Classify tool focuses on an area of Data Mining known as "Decision Trees".  Basically, it's a series of cascading Yes/No questions that branch out like a tree.  If you want a more complete explanation, you can check out the wikipedia article.  Let's jump in.  As usual, we will be using the Data Mining Sample data set from Microsoft.
Select Data Source
The first step to any analysis is to select a data source.  We have our table in Excel.  However, as we mentioned earlier, you can also use the "External Data Source" option to directly query a SQL Database.  Let's move on.
Column Selection
Next, we need to select which column we would like to analyze and which columns we would like to use for that analysis.  Let's try to predict whether the customer will buy a bike using all of variables, excluding ID and Purchased Bike.  Now, let's check out the "Parameters" section.
This is where the customization comes in.  This algorithm has seven parameters that can be customized to narrow the analysis.  We can see that most parameters have ranges that tell us the values that it expects.  For instance, you couldn't put a value of 5 for the Complexity Penalty.  Some of the values also have defaults.  For specifics on how each of these parameters functions, look here.  We won't mess with these just yet.
Training Set Selection
In every data mining algorithm, there are two sets of data that we use, Training and Testing.  The Training Set is what the algorithm uses to develop the underlying rules and predictions.  Then, it uses the Testing Set to see if the rules and predictions are accurate.  We'll talk more about this later.  Microsoft has a default of 70% Training, 30% Testing which is good enough for us.  Let's move on.
Model Creation
Finally, we get to a window we have not seen before.  The Data Mining tools are not just ad-hoc analyses done in Microsoft Excel.  They are actually legitimate data mining models built in Analysis Services.  In Analysis Services, each model is stored within a structure.  A structure is a logical grouping of models used for organizational purposes.  When you go through the process of using a Data Mining tool, such as Classify, a structure and model are created in your Analysis Services instance.  This is especially cool because it allows you reuse these models at later dates.  We'll get to that much later in the series.

We have three options here, Browse Model, Enable Drillthrough, and Use Temporary Model.  Browse Model allows us to see our results graphically in a similar fashion to the way we did for the Table Analysis Tools.  Enable Drillthrough allows us to look at the underlying data that was used for a certain conclusion.  Use Temporary Model would make our analysis temporary so that it doesn't appear in the Analysis Services instance.  We typically enable the first two options.  Let's see the results.
Browse (All)
These results are not quite as flashy as the ones we were seeing from the Table Analysis tools, but they are much more useful as well.  On the left side, we can see our tree (which currently only has one split).  On the right side, we can see the composition of the selected node.  Since we haven't selected anything, we see the composition of the population.  If we select one of the nodes, we get a different story.
Browse (Subset)
If we look in the bottom-right of the window, we see the qualifications for the node we select, Age < 32 or >= 39.  In other words, not in the 30s.  We can see that this group has a high percentage of people who don't buy bikes.  As a bit of a shortcut, we can look at the bar on the bottom of each node.
The legend on the right side tells us that Blue denotes No and Red denotes Yes.  So, we can look at the shortcut bars and see that the "Non-30s" node is mostly No and the "30s" node is mostly Yes.  Since we selected the "Allow Drillthrough" option, we can right-click on a node and drill-through to the underlying data.
Drill-through options
We can drill-through to either the Model Columns or the Structure Columns.  In our example, they are the same.  However, if you were to create multiple models under the same structure, you could restrict some models to only use some columns.  If we click one of the options, we see the underlying data.
Underlying Data
This would be great if we wanted to do some further analysis on a particular node.  Let's stop our analysis here for now.  As you can see, there's quite a bit of customization that can be done here.  Notice how our tree only had one branch?  Stay tuned for our next post where we tackle that issue and show some more analysis that can be done using Classification Trees.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

Monday, August 4, 2014

Data Mining in Excel Part 14: Shopping Basket Analysis

Today, we'll be talking about the final Table Analysis Tool and one of most commonly misunderstood topics in all of data mining, Shopping Basket Analysis.
Shopping Basket Analysis
This is a topic we've discussed with multiple clients and analytics vendors.  Every client wants it and every vendor says they can do it.  However, most of the time, the vendors will give you charts showing average number of items in a basket, average value of a basket, most popular items, etc.  But, this is not what the term actually means.

Shopping Basket Analysis, aka Market Basket Analysis, is an algorithm that tells you which items tend to be bought together.  For instance, do diapers and beer tend to be bought together?  This is a funny, old story with questionable origins that has been shown to not be true.  The interesting thing about this whole area is that Analysis Services has a built-in algorithm to do this for you, using data the almost every retailer will have.  Let's look at the data.
This table is just a list of the items sold and their order numbers.  In fact, we've worked with quite a few different types of retailers and servicers that have data like this already.  Now, let's see the algorithm in action.
Column Selection
First, we need to tell the algorithm what the ID is for each transaction.  This is usually known as the Order Number or Order ID.  Next, we need to tell the algorithm what the name of the item is.  In most cases, you don't want to use the actual product name.  We are interested in whether cereal and milk are typically bought together.  We aren't interested in whether Kellogg's Raisin Bran is purchased with Pet 2% milk.  Those answers are way too specific to be useful and are almost guaranteed to hurt your analysis.  So, we are using the category instead.  Optionally, you can select a value so that the algorithm can determine how much each grouping is worth.  Finally, there are a few advanced options we can look at.
Add caption
Here, we can define the Minimum Support and Minimum Probability.  The support is total number of times that the grouping needs to appear in order to be valid.  For instance, if only one customer ever purchased batteries, and that customer also purchased milk, you don't want to say "100% of customers who buy batteries also buy milk."  So, we get to set a minimum support in order to ensure accuracy.  This value depends entirely on the size and distribution of your data set.  For our analysis, we'll use 10 items.

The probability is the number of times the rule is correct within the given support.  For instance, if 100 customers buy cereal, and 10 of those customers also buy ice cream, we may not want to make decisions based off of a 10% probability.  The algorithm has a default probability of 40% that works fine for us.  You can tune it for your needs if you wish.  Let's check out the results.
We see that Road Bikes and Helmets are bought together quite a bit.  We can also see Jersey and Road Bikes, Bottles and Cages, and many other combinations.  The one downside to this algorithm is that it doesn't exclude the obvious choices.  For instance, most business analysts will know that some items are bought together.  For instance, why would you buy a bottle without a cage to put it in?  So, this algorithm is only a starting point for your analysis.  However, it does offer you statistically valid results that you can use to make real decisions.  As much as we'd love to use Tableau to liven up the algorithm a little.  The results don't lend themselves to easy integration with the original data set.  Therefore, this is as far we go for today.

The great thing about this algorithm is that it doesn't just apply to retailers.  Banks could use this algorithm to see which packages customers tend to buy together.  Network Administrators could use this algorithm to see which issues tend to occur together.  The possibilities are limited only by your imagination.  Keep an eye out for the next post on Classification.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit