Tuesday, May 27, 2014

Data Mining in Excel Part 4: Re-Labelling your Data

Today, we will talk about the second portion of the "Clean Data" feature of the Data Mining Add-in for Excel, "Re-Label".
Clean Data
While the Outlier feature was useful for removing extreme numeric observations, the Re-Label feature allows to you change string values.  As usual, we will be using the Data Mining Sample Data from Microsoft.  Let's skip the foreplay and jump right into it.
Select Source Data
Of course, the first step is to select our data set.
Select Column
Then, we select the column where we would like to Re-label some values.  For this example, let's choose Education.
Re-Label Education
Now, we can proceed to Re-label these columns in any way we want.  For instance, what if we aren't interested in Partial College education?  If they didn't get the degree, then their education is High School.  That's an easy change here.
Partial College -> High School
We aren't forced to choose from existing options either.  What if we want more consistent labels?  We can easily alter each label to look a little more professional.
Professional Titles
This isn't all we can do either!  What if we were Insurance Adjusters and we needed to rate these people.  We could assign points to them based on their level of education.  More education means more points, leading to a lower insurance premium.
As you can see, this is all really easy to do.  In fact, there was an interesting anecdote in the Data Mining book we mentioned in the first post of this series.  A man was presenting the Data Mining tool in a foreign country.  At the start of his presentation, he noticed that the entire data set was in English.  In a matter of minutes, he was able to translate the entire data set into their native language just by using this feature.  We have no idea if this actually happened or not.  But, after using this feature, we don't have any doubts that it could be done.  Now, what do we do with these labels?  We have a couple of options.
Select Destination
We can append the new column to our data, copy the changes into a new worksheet, or just replace the data where it is.  We recommend appending columns because it's typically not a good idea to outright change your data.
Data with Points
Now that we've re-labelled our data, it's time to move on with our analysis.  Unfortunately, you're just going to have to wait for the next post.  In the next entry, we'll be talking about the Sample data feature which allows you to easily chop up giant data sets into smaller ones for improved performance without losing any statistical significance.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC

No comments:

Post a Comment