Monday, November 27, 2017

Azure Machine Learning in Practice: Feature Engineering

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection, Model Evaluation, Threshold Selection and Feature Selection phases of the experiment.  In this post, we're going to walk through the feature engineering process.

Feature Engineering is the process of adding new features or transforming existing features in the input dataset.  The goal of Feature Engineering is to create features that will greatly strengthen the model in terms of performance or accuracy.  This is a huge area within Machine Learning.  We can't possibly do it justice in just one post.  However, we can talk about a few different ways we would approach this problem using our dataset.

Just like with Feature Selection, traditional machine learning would usually start this as one of the first steps of any machine learning process.  However, Azure Machine Learning makes it so easy to throw the raw data at a large pool of models to see what happens.  We oftentimes find that we can create good models without much feature transformation.  This leaves us in the situation where we can decide whether the additional performance or accuracy is worth the time required to engineer more features.

In this case, let's assume that 93% precision and 90% recall is not high enough.  This means that we may gain some additional accuracy from feature engineering.  Let's take a look at the dataset again for a refresher.
Credit Card Fraud Data 1

Credit Card Fraud Data 2
The following description is lifted from the first post in this series.

We can see that this data set has the following columns: "Row Number", "Time", "V1"-"V28", "Amount" and "Class".  The "Row Number" column is simply used as a row identifier and should not be included in any of the models or analysis.  The "Time" column represents the number of seconds between the current transaction and the first transaction in the dataset.  This information could be very useful because transactions that occur very rapidly or at constant increments could be an indicator of fraud.  The "Amount" column is the value of the transaction.  The "Class" column is our fraud indicator.  If a transaction was fraudulent, this column would have a value of 1.

Finally, let's talk about the "V1"-"V28" columns.  These columns represent all of the other data we have about these customers and transactions combined into 28 numeric columns.  Obviously, there were far more than 28 original columns.  However, in order to reduce the number of columns, the creator of the data set used a technique known as Principal Component Analysis (PCA).  This is a well-known mathematical technique for creating a small number of very dense columns using a large number of sparse columns.  Fortunately for the creators of this data set, it also has the advantage of anonymizing any data you use it on.  While we won't dig into PCA in this post, there is an Azure Machine Learning module called Principal Component Analysis that will perform this technique for you.  We may cover this module in a later post.  Until then, you can read more about it here.

Now, let's talk about three different types of engineered features.  The first type are called "Discretized" features.  Discretization, also known as Bucketing and Binning, is the process of taking a continuous feature (generally a numeric value) and turning it into a categorical value by applying thresholds.  Let's take a look at the "Amount" feature in our data set.
Amount Statistics
Amount Histogram
Obviously, we aren't going to get any information out of this histogram without some serious zooming.  We did spend some time on this, but there's really not much to see there.  Therefore, we can just use the information from the Statistics.

We see that the mean is four times larger than the median.  This indicates that this feature is heavily right-skewed.  This is exactly what we're seeing in the histogram.  Most of the records belong to a proportionally small set of values.  This is an extremely common trend when looking at values and dollar amounts.  So, how do we choose the thresholds we want to use?  First and foremost, we should use domain knowledge.  There is no way to replace a good set of domain knowledge.  We firmly believe that the best feature engineers are the ones who know a data set and a business problem very well.  Unfortunately, we're not Credit Card Fraud experts.  However, we do have some other tools in our toolbox.

One technique for discretizing a heavily skewed feature is to create buckets by using powers.  For instance, we can create buckets for "<$1", "$1-$10", "$10-$100" and ">$100".  In general, we like to handle these using the "Apply SQL Transformation" module.  You could easily do the same using R or Python.  Here's the code we used and the resulting column.

SELECT
    *
    ,CASE
        WHEN [Amount] < 1 THEN "1 - Less Than $1"
        WHEN [Amount] < 10 THEN "2 - Between $1 and $10"
        WHEN [Amount] < 100 THEN "3 - Between $10 and $100"
        ELSE "4 - Greater Than $100"
    END AS [Amount (10s)]
FROM t1
Amount (10s) Histogram
Using this technique, we were able to take an extremely skewed numeric feature and turn it into an interepretable discretized feature.  Did this help our model?  We won't know that until we build a new model.  In general, the goal is to create as many new features as possible, seeing which ones are truly valuable at the end.  In fact, there are other techniques out there for choosing optimal thresholds than can provide tremendous business value.  We encourage you to investigate this.  In this experiment, we'll create new features for Amount (2s) and Amount (5s), then move on.

SELECT
    *
    ,CASE
        WHEN [Amount] < 1 THEN '1 - Less Than $1'
        WHEN [Amount] < 2 THEN '2 - Between $1 and $2'
        WHEN [Amount] < 4 THEN '3 - Between $2 and $4'
        WHEN [Amount] < 8 THEN '4 - Between $4 and $8'
        WHEN [Amount] < 16 THEN '5 - Between $8 and $16'
        WHEN [Amount] < 32 THEN '6 - Between $16 and $32'
        WHEN [Amount] < 64 THEN '7 - Between $32 and $64'
        WHEN [Amount] < 128 THEN '8 - Between $64 and $128'
        ELSE '9 - Greater Than $128'
    END AS [Amount (2s)]
    ,CASE
        WHEN [Amount] < 1 THEN '1 - Less Than $1'
        WHEN [Amount] < 5 THEN '2 - Between $1 and $5'
        WHEN [Amount] < 25 THEN '3 - Between $5 and $25'
        WHEN [Amount] < 125 THEN '4 - Between $25 and $125'
        ELSE '5 - Greater Than $125'
    END AS [Amount (5s)]
    ,CASE
        WHEN [Amount] < 1 THEN '1 - Less Than $1'
        WHEN [Amount] < 10 THEN '2 - Between $1 and $10'
        WHEN [Amount] < 100 THEN '3 - Between $10 and $100'
        ELSE '4 - Greater Than $100'
    END AS [Amount (10s)]
FROM t1
Amount (2s) Histogram
Amount (5s) Histogram
The next type of engineered features are called "Transformation" features.  Transformation, at least in our minds, is the process of taking an existing feature and applying some type of basic function to it.  In the previous example, we tried to alleviate the skew of the "Amount" feature by Discretizing it.  However, there are other options.  For instance, what if we were to create a new feature "Amount (Log)" by taking the logarithm of the existing feature?

SELECT
    *
    ,LOG( [Amount] + .01 ) AS [Amount (Log)]
FROM t1
Amount (Log) Histogram
We can see that this feature now strongly resembles a bell curve.  Honestly, this makes us question whether this data set is actually real or if it's faked.  In the real world, we don't often find features that look this clean.  Alas, that's not the purpose of this post.

There are plenty of other functions we can use for Transformation features.  For instance, if a feature has a large amount of negative values, and we don't care about the sign, it could help to apply a square or absolute value function.  If a feature has a "wavy" distribution, it could help to apply a sine or cosine function.  There are a lot of options here that we won't explore in-depth in this post.  For now, we'll just keep the Amount (Log) feature and move on.

The final type of engineered features are "Interaction" features.  In some cases, it can be helpful to see what effect the combination of multiple fields has.  In the binary world, this is really simple.  We can create an "AND" feature by taking [Feature1] * [Feature2].  This is what people generally mean when they say "Interactions".  Less commonly, we also see "OR" features by taking MAX( [Feature1], [Feature2] ).  There's nothing stopping us from taking these binary concepts and applying them to continuous values as well.  For instance, let's create two-way interaction variables for ALL of the "V" features in our dataset.  Since there are 28 "V" features in our dataset, there are 28 * 27 / 2 possibilities.  This is way too many to manually write using SQL.  This is where R and Python come in handy.  Let's write a simple R script to loop through all of the features, creating the interactions.

#####################
## Import Data
#####################

dat.in <- maml.mapInputPort(1)
temp <- dat.in[,1]
dat.int <- data.frame(temp)

################################################
## Loop through all possible combinations
################################################

for(i in 1:27){
    for(j in (i+1):28){
     
#########################################
## Choose the appropriate columns
#########################################

        col.1 <- paste("V", i, sep = "")
        col.2 <- paste("V", j, sep = "")
     
        val.1 <- dat.in[,col.1]
        val.2 <- dat.in[,col.2]

#######################################
## Create the interaction column
#######################################

        dat.int[,length(dat.int) + 1] <- val.1 * val.2
        names(dat.int)[length(dat.int)] <- paste("V", i, "*V", j, sep="")
    }
}

###################
## Outputs Data
###################

dat.out <- data.frame(dat.in, dat.int[,-1])
maml.mapOutputPort("dat.out");

Interaction Features
We can see that we now have 410 columns, with all of the interaction columns coming at the end of the data set.  Now, we need to determine whether the features we added were helpful.  How do we do that?  We simply run everything through our Tune Model Hyperparameters experiment again.  Good thing we saved that.  An important note here is that Neural Networks, Support Vector Machines and Locally-Deep Support Vector Machines are all very sensitive to the size of the data.  Therefore, if we are doing large-scale feature engineering like this, we may need to exclude those models, reduce the number of random sweeps or use the faster models to perform feature selection BEFORE we create our final model.

After running the Tune Model Hyperparameters experiment on the new dataset, we've determined that the engineered features did not help our model.  This is not entirely surprising to us as the features we added were mostly for educational purposes.  When engineering features, we should generally take into account existing business knowledge about which types of features seem to be important.  Unfortunately, we have no business knowledge of this dataset.  So we ended up blindly engineering features to see if we could find something interesting.  In this case, we did not.

Hopefully, this post opened your minds to the world of feature engineering.  Did we cover all of the types of engineered features in this post?  Not by a long shot.  There are plenty of other types, such as Summarization, Temporal and Statistical features.  We encourage you to look into these on your own.  We may even have another post on them in the future.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Science Consultant
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

No comments:

Post a Comment