Monday, June 30, 2014

Data Mining in Excel Part 9: Forecasting

Today, we will talk about one of the simplest, yet most useful, Table Analysis Tools, Forecast.
Forecasting is a very common practice that many tools can handle.  Previously on this blog, we've spoken about Forecasting in Tableau and R.  Now, let's see how the Table Analysis Tools handle it.

In general, the Table Analysis Tools are narrower versions of the algorithms in Analysis Services.  The bad thing about this is that there is no tweaking available within the Table Analysis Tools Forecasting algorithm.  If you want to play with the results, you will either need to use the Data Mining Forecasting algorithm in Excel, which we cover in a later post, or Analysis Services, which we may or may not cover.  Now, let's take a look at the data.
 We have sales for three different regions, Europe, North America, and Pacific.  We also have a date stamp of some type.  When dealing with Year/Month values, we typically prefer to use Year + Month / 12 instead of YearMonth.  However, the algorithm doesn't care very much.  In fact, the time stamp seems to only be there for the user to see.  The algorithm doesn't seem to use it at all.  Let's get started.
Column Selection
As usual, the algorithm lets us choose which columns we want to use.  We can also tell the algorithm how many far into the future we want to forecast.  Be careful how far you forecast into the future.  The further ahead you look, the less accurate the prediction becomes.  Also, the algorithm allows you to choose the periodicity of the data.  For instance, if your data is monthly data, then you might have a yearly periodicity.  This means that the data from this July is similar to the data from last July.  However, the algorithm is pretty good at picking up on periodicity.  Therefore, it's recommended that you use "Detect Automatically".  Finally, you can tell the algorithm what to use as a time stamp.  The algorithm doesn't actually use the time stamp, but it does make the resulting graph look much more intuitive.  Now, let's see the output.
Forecast Chart
Voila!  We now have forecasts for the next five months.  As expected, the forecasted values have also been appended to the bottom of the original table.
Forecast Table
This is the point where we'd usually mock-up a simple dashboard showing you the amazing visuals you can see with the data.  However, this data only contains Sales, Region, and MonthYear, which aren't the right types of variables variables to create any interesting visuals.  However, there's nothing stopping you from combining the algorithms in you own data.  For instance, you could use "Detect Categories" on your customers, then forecast the next six months of sales for each category.  You could then use these forecasts to see which categories are likely to be profitable.  Stay tuned for the next post where we'll be talking about Highlight Exceptions.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

Monday, June 23, 2014

Data Mining in Excel Part 8: Fill From Example

Today, we're going to talk about the next in the line of Table Analysis tools, Fill From Example.
Fill From Example
The Fill From Example tool allows you to "impute" values into your data set.  We've spoken briefly on imputation in Part 3 of this series.  In layman's terms, it means filling in unknown or blank values with another value of your choosing.  Imagine you are selling car stereos and would like to know what type of vehicle your customers drive.  But, you can't ask every customer you've ever had.  So, you ask a subset of the customers and apply some logic to make an educated guess as to what type of vehicles the rest of your customers drive.  This algorithm is designed to do exactly that.

In this data set, we have the same customers we've been looking at all along.  However, we also have an extra column now called "High Value Customer".  This column indicates whether a customer is worth a lot of money.  However, we only have values for a small number of customers.
So, let's use the Fill From Example tool to make an educated guess about the rest of the values.
Column Selection
As usual, the first step requires us to choose which columns we want to use.  We want to use the dropdown box to select the column we would like to impute values into.  We can also click on the blue link to tell the algorithm that only certain columns should be considered in this analysis.  Once we're done, another sheet gets added to our workbook.
Discrimination Table
Here, we see a familiar table.  This is a "Discrimination Table".  We can use it to determine which values influence other values in our data set.  For instance, we see that customers from the Pacific, who commute 5 - 10 miles, or are female, tend to not be very valuable customers.  On the other hand, customers who are Professionals, commute 2 - 5 miles, or have 5 children, are likely to be valuable.  We want to iterate again (and not for the last time) that these algorithms measure CORRELATION, NOT CAUSATION.  This is arguably the most important point to emphasize when you are talking about Data Mining.  Now, let's go back to our data.
Customers (Imputed)
We see that the tool also appended a column to our data set with the predicted values.  We should note that the tool will not predict values that actually exist, since we already know what the right answer is.

Finally, we are at the point where we can start combining the tools we have been using and gain more insight.  Let's see what kind of visualizations we can build if we combine Detect Categories and Fill From Example.
As you can see, our visualizations just got that much better, simply by clicking a button and waiting 15 seconds.  This just goes to show how much insight you can gain with a tool this powerful.  Stay tuned for the next post in the series, Forecasting.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

Monday, June 16, 2014

Data Mining in Excel Part 7: Detect Categories

Today, we will be talking about the second component in the Table Analysis Tools ribbon, Detect Categories.
Detect Categories
This component is designed to let you group your data into segments based on how similar the observations are to each other.  Generally, the observations in a segment are more similar to each other than they are to observations in other segments.  In more technical terms, this methodology is known as "Clustering", and is possibly the most common (and useful!) practice in Data Mining.  As usual, we will be using the Data Mining Sample data set from Microsoft.

The greatest aspect of clustering is that it allows you to understand what types of values are in your data on a larger scale; and it takes very little knowledge of your data to do so.  Let's begin.
Column Selection
As you can see, it automatically identifies that ID should not be in the algorithm and removes it.  Since we don't have any previous knowledge of this data, we want to give it everything.  If you are looking for specific types of groupings, or don't care about certain fields, then you can remove them here.  We also want to mention that specifying a precise number of categories takes a good bit of power away from the algorithm.  If you want to find out what's in your data, why would you assume that you already know how many types of data you have?  Therefore, it's generally a good idea to use the "Auto-detect" feature.  Also, take note of the check box at the bottom of the window.  We'll come back to that later.  Let's move on.
Category Row Counts
Voila!  Our data has been categorized into seven categories with between 100 and 200 rows in each category.  Now, the fun part begins.
Category 1 Characteristics
If we scroll down the worksheet a little bit, we see this chart.  It shows which values are important for which cluster.  These values are listed in order of "Importance".  This basically assigns a number to how uniquely this value represents this cluster, the higher the value, the more it uniquely describes the cluster.  So, we can use these values to give our cluster an intuitive name.  The values with low importance are, pardon the redundancy, not important.  So, let's use the first few values and call this cluster "Blue-Collar Europeans".  Note that this is definitely the art within the science.  We also think it's a lot of fun.  Even better, if we rename the cluster using the "Category Row Counts" chart, it changes the names everywhere.
First Column Named
Blue-Collar European Characteristics
Now, let's repeat this process for Category 2.
Category 2 Characteristics
This one doesn't have quite the same ring to it, but we'll call it "Graduate Degree with Low Income and No Car".  Now, we'll repeat this process for the rest of the Categories and show you the end result.
All Categories Named
Now that we've named all of the categories, let's check our original data source.
Original Data
The add-in automatically added and updated the category column onto our original data set.  This would also allow us to use these categories in any further analysis we would want to.  Furthermore, if we go back to the output sheet again, we can see another chart further down.
Category Comparison
This chart allows us to compare our categories across one or more variables.  For instance, this view shows us that Category 3 is made up almost entirely of young people and Category 6 has the highest percentage of middle-aged people.  You could also use the filters at the bottom to show multiple variables at the same time.  However, we believe this to be too cluttered for any real analysis.  Unfortunately, this chart doesn't update with the names we gave our Categories.  Lastly, we see a very large table at the bottom of the sheet.
In this table, each row tells us how many rows had each value in each category, which is known as the "Support".  For instance, we see that there 539 married people and 461 single people in the data.  We're not sure what would cause the supports to be decimals, as they should be integers.  Alas, the decimals aren't very important anyway.  If they bother you, feel free to round the values to the nearest integer.

Finally, this is the first algorithm that appends a result onto the data set.  This means that we can use this value in any sort of visualizations.  Avid readers will know that this blog heavily emphasizes our favorite visual analytics tool, Tableau.  Here's a tiny bit of what you can do if you combine the two tools.
The Detect Categories, and later the Clustering algorithm, is probably the most useful component in the entire Data Mining feature set from Microsoft.  It makes it so much easier to investigate your data in entirely new ways.  There is so much exploration that can be done and so much insight that can be gained if you are willing to put just a small amount of effort into it.  Stay tuned for the next entry, Fill From Example.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Director, Consumer Sciences
Consumer Orbit

Monday, June 9, 2014

Data Mining in Excel Part 6: Analyze Key Influencers

Today, we will move over to the "Analyze" Ribbon and talk about the "Analyze Key Influencers" component.
Analyze Key Influencers
If you click on a cell within a table in Excel, you will see the Analyze and Design Ribbons at the top of the screen.  Now, let's take a moment to discuss the Analyze Ribbon.  In order of customizability and complexity of the algorithms, the Analyze Ribbon is the simplest.  This makes it a great place to start our exploration into Data Mining.  These algorithms function within Excel and, in our opinion, have some of the best outputs.  Let's look at Analyze Key Influencers.

Analyze Key Influencers allows you to find out which columns in your data seem to heavily influence other columns.  Are Number of Children and Number of Cars correlated?  How about Income and Education?  These are very common questions that people ask, and this is the first place where we can really start to get answers.  Let's get going.
Column Selection
Since this component is only accessible from within a Table, the algorithm only uses data from within the table.  Next, we have to tell it which column we want to find the influencers of.  Everybody wants to know "What factors influence how much money people make?"  Well, here's where we can find out.  Please note that this is a sample data set and is very unlikely to unlock the golden ticket for you to earn a billion dollars.  We also want to select which columns to use for the analysis by clicking the hyperlink at the bottom of the box.
Advanced Column Selection
The ID field is a numeric value randomly assigned to each person by the database, so we almost never want to use those in our analyses.  Also, since we want to see what influences Income, we should also remove Income because it could skew our analysis.  After all, what's a better indicator of your Income than your Income?  Now, let's move on.
Detecting Patterns
When you run any of the Table Analysis Tools, you will see this box letting you know that it's working.
Discrimination Reporting
Now, we get to ask the algorithm to compare groups for us.  Since Income is numeric, it separates the values into buckets.  This process is called discretization.  We can choose to compare different categories to each other, or see what makes them unique from all other categories.  Let's start by comparing the lowest incomes category, which we'll call "Lower Class", from the second lowest incomes, which we'll call "Lower-Middle Class".  When we click "Add Report", another section is added to our Excel sheet that we can look at.  Let's check it out.
Key Influencers Report for Income
The first section we see shows us which values in other columns seems to correlate with certain values in Income.  The column labelled "Relative Impact" is a statistical ranking that shows us how strong the relationship is between the two values.  For instance, we see that the strongest predictor for having a low income is having a manual job.  We see that living in Europe and having a clerical job also influence having a low income.  Please take care to note that this is CORRELATION, not CAUSATION.  Nothing in this data set implies that having a manual job causes a low income.  It simply states that the two seem to occur together.  Let's see what distinguishes the highest income group.
Influencers for High Income
We see that the heaviest influencer for having high income is having four cars.  Obviously, we know that having four cars doesn't make you rich.  In fact, we believe that it's the opposite.  You have four cars because you have a high income.  This just goes to show that care should be taken when interpretting the results.  We also see that having a management job and having three cars are also heavy influencers.

This leads us to another question, why did the algorithm discretize Income, but not Cars?
Despite the fact that Cars is numeric, it doesn't have too many distinct values.  Therefore, discretizing the value wouldn't make much sense.  Off the top of our head, we're not sure how many distinct values you have to have in order to get discretized, but we think it's probably at least eight.

Now, let's hope over to our Key Influencer Report again.  Remember that analysis we asked for comparing lower class to lower-middle class?  If we scroll down a little further, we see that it's been tacked right onto the bottom.
Discrimination Report
This is called a discrimination report.  It's designed to show you which values "Favor" which categories.  The report we previously looked at tried to discriminate each category from all the other categories.  This report only tried to discriminate one category from another.  We see that the same influencers we saw before are on this report as well.  Please note that this will not always be the case, it just happens to be the case in this data.  We also see that people in North America, with professional or management jobs are more likely to be in the lower-middle class.  Looking further down the list, we also see the introduction of lesser factors, such as education and commute distance.

We hope you enjoyed our first look at a true analytical procedure.  Stay tuned for the next post where we'll be talking about the "Detect Categories" component.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC

Monday, June 2, 2014

Data Mining in Excel Part 5: Sampling from your Data

Today, we're going to talk about the last component of the Data Preparation Segment in the Data Mining Add-in for Excel, Sample Data.  Some of us have LOTS of data.  If you have a table with 1,000,000 rows of data and 50 columns, you're asking a lot from Excel when you try to run some analysis on that data.  So, a common practice is to randomly choose a handful of those rows, let's say 10,000, to do your analysis on.  The keyword in that sentence is "random".  If you choose the rows yourself, you may introduce some sort of unknown bias into your sample, which could heavily influence the results of your analysis.  So, we're going to let the tool do it for us.  As usual, we will be using the Data Mining sample data set from Microsoft.
Sample Data
The first step of any analysis is to select the source of data.
Select Source Data
This component is the first time we've seen the "External Data Source" option.  This is really cool.  Imagine you work for a giant bank that has a fact table for all of the transactions for all of your customers.  This fact table could easily be billions of rows.  That's way too much for Excel or Analysis Services to handle without some very powerful hardware.  Have no fear!  The Data Mining add-in will let you pick a few thousand rows from that table and play with them right in Excel.  Better yet, it will RANDOMLY pull those rows.  Also, you can right your own query here as well.  So, if you want to RANDOMLY select transactions from this year at all ATMs in the Southeast region, you can.  However, in our example, we only have 1000 rows of data.  Let's move on.
Select Sample Type
There are two types of sampling available here, Random Sampling and Oversampling.  Random sampling is the more common type.  It works just like flipping a coin or rolling a die.  No attention is paid to the data, it is only picked.  On the other hand, Oversampling is used to ensure that rare conditions are met.  For instance, if you want to run a test that compares millionaires to non-millionaires, then most random samples wouldn't have any millionaires in them at all.  It's difficult to test for something that doesn't occur in your data.  Back to our millionaires example, if you were to oversample your data, you could guarantee that at least 10% of your sample includes millionaires.  This does cause issues for some types of analysis, so only use it when you absolutely need to.  It's a very common practice in Fraud detection, where analysts are routinely looking for one transaction out of millions.  For now, let's just stick with a random sample.
Random Sampling Options
We're given two options.  We can either sample a fixed percentage of our data, or a fixed number of rows.  Let's stick with the default, 70%.
Random Sampling Finish
Finally, we get to name the sheet where the results will be stored.  Oddly enough, you can also choose to store the unselected data.  We can't think of any reason as to why you would want this data.  We will say that if you want to run repeated tests on samples, then you need to sample each time.  You can't create two random samples simply by using the selected and  unselected data from your first sample.  Now, let's go back and see what happens when we try Oversampling.
Oversampling Options
Here, we can decide which value we want to see more of.  We can also decide what percentage of our sample will have that value.  In fact, that percentage is guaranteed.  Consider this example.  You have a data set with 13 rows, 10 of which have yes and 3 of which have no.  If you oversample the data with a 30% chance of no and a sample of size 6, the algorithm will randomly select 2 of the 3 no values, and 4 of the 10 yes values.  However, if you increase the sample size to 13 (the entire data set), the algorithm will realize that there are only 3 no values, and restrict the sample size to 10.  This means that you will always have at least your requested percentage of a particular value.

These techniques are well respected in the scientific community and can be really useful when you want to explore and analyze really large data sets.  Stay tuned for the next post where we'll be talking about the "Analyze Key Influencers" component.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC