Today, we will talk about the second portion of the "Clean Data" feature of the Data Mining Add-in for Excel, "Re-Label".
|
Clean Data |
While the Outlier feature was useful for removing extreme numeric observations, the Re-Label feature allows to you change string values. As usual, we will be using the Data Mining Sample Data from Microsoft. Let's skip the foreplay and jump right into it.
|
Select Source Data |
Of course, the first step is to select our data set.
|
Select Column |
Then, we select the column where we would like to Re-label some values. For this example, let's choose Education.
|
Re-Label Education |
Now, we can proceed to Re-label these columns in any way we want. For instance, what if we aren't interested in Partial College education? If they didn't get the degree, then their education is High School. That's an easy change here.
|
Partial College -> High School |
We aren't forced to choose from existing options either. What if we want more consistent labels? We can easily alter each label to look a little more professional.
|
Professional Titles |
This isn't all we can do either! What if we were Insurance Adjusters and we needed to rate these people. We could assign points to them based on their level of education. More education means more points, leading to a lower insurance premium.
|
Points |
As you can see, this is all really easy to do. In fact, there was an interesting anecdote in the Data Mining book we mentioned in the first post of this series. A man was presenting the Data Mining tool in a foreign country. At the start of his presentation, he noticed that the entire data set was in English. In a matter of minutes, he was able to translate the entire data set into their native language just by using this feature. We have no idea if this actually happened or not. But, after using this feature, we don't have any doubts that it could be done. Now, what do we do with these labels? We have a couple of options.
|
Select Destination |
We can append the new column to our data, copy the changes into a new worksheet, or just replace the data where it is. We recommend appending columns because it's typically not a good idea to outright change your data.
|
Data with Points |
Now that we've re-labelled our data, it's time to move on with our analysis. Unfortunately, you're just going to have to wait for the next post. In the next entry, we'll be talking about the Sample data feature which allows you to easily chop up giant data sets into smaller ones for improved performance without losing any statistical significance. Thanks for reading. We hope you found this informative.
Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn
No comments:
Post a Comment