Monday, October 16, 2017

Azure Machine Learning in Practice: Threshold Selection

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection and Model Evaluation phases of the experiment.  In this post, we're going to walk through the threshold selection process.

So far in this experiment, we've taken the standard Azure Machine Learning evaluation metrics without much thought.  However, an important thing to note is that all of these evaluation metrics assume that the prediction should be positive when the predicted is probability is greater than .5 (or 50%), and negative otherwise.  This doesn't have to be the case.

In order to optimize the threshold for our data, we need a data set, a model and a module that optimizes the threshold for the data set and model.  We already have a data set and a model, as we've spent the last few post building those.  What we're missing is a module to optimize the threshold.  For this, we're going to use an Execute R Script.

Execute R Script
Execute R Script Properties
We talked about R scripts in one of our very first Azure Machine Learning posts.  This is one of the ways in which Azure Machine Learning allows us to expand its functionality.  Since Azure Machine Learning doesn't have the threshold selection capabilities we're looking for, we'll build them ourselves.  Take a look at this R Script.

CODE BEGIN

dat <- maml.mapInputPort(1)

###############################################################
## Actual Values must be 0 for negative, 1 for positive.
## String Values are not allowed.
##
## You must supply the names of the Actual Value and Predicted
## Probability columns in the name.act and name.pred variables.
##
## In order to hone in on an optimal threshold, alter the
## values for min.out and max.out.
##
## If the script takes longer than a few minutes to run and the
## results are blank, reduce the num.thresh value.
###############################################################

name.act <- "Class"
name.pred <- "Scored Probabilities"
name.value <- c("Scored Probabilities")
num.thresh <- 1000
thresh.override <- c()
num.out <- 20
min.out <- -Inf
max.out <- Inf
num.obs <- length(dat[,1])
cost.tp.base <- 0
cost.tn.base <- 0
cost.fp.base <- 0
cost.fn.base <- 0

#############################################
## Choose an Optimize By option.  Options are
## "totalcost", "precision", "recall" and
## "precisionxrecall".
#############################################

opt.by <- "precisionxrecall"

act <- dat[,name.act]
act[is.na(act)] <- 0
pred <- dat[,name.pred]
pred[is.na(pred)] <- 0
value <- -dat[,name.value]
value[is.na(value)] <- 0

#########################
## Thresholds are Defined
#########################

if( length(thresh.override) > 0 ){
thresh <- thresh.override
num.thresh <- length(thresh)
}else if( num.obs <= num.thresh ){
thresh <- sort(pred)
num.thresh <- length(thresh)
}else{
thresh <- sort(pred)[floor(1:num.thresh * num.obs / num.thresh)]
}

#######################################
## Precision/Recall Curve is Calculated
#######################################

prec <- c()
rec <- c()
true.pos <- c()
true.neg <- c()
false.pos <- c()
false.neg <- c()
act.true <- sum(act)
cost.tp <- c()
cost.tn <- c()
cost.fp <- c()
cost.fn <- c()
cost <- c()

for(i in 1:num.thresh){
thresh.temp <- thresh[i]
pred.temp <- as.numeric(pred >= thresh.temp)
true.pos.temp <- act * pred.temp
true.pos[i] <- sum(true.pos.temp)
true.neg.temp <- (1-act) * (1-pred.temp)
true.neg[i] <- sum(true.neg.temp)
false.pos.temp <- (1-act) * pred.temp
false.pos[i] <- sum(false.pos.temp)
false.neg.temp <- act * (1-pred.temp)
false.neg[i] <- sum(false.neg.temp)
pred.true <- sum(pred.temp)
prec[i] <- true.pos[i] / pred.true
rec[i] <- true.pos[i] / act.true
cost.tp[i] <- cost.tp.base * true.pos[i]
cost.tn[i] <- cost.tn.base * true.neg[i]
cost.fp[i] <- cost.fp.base * false.pos[i]
cost.fn[i] <- cost.fn.base * false.neg[i]
}

cost <- cost.tp + cost.tn + cost.fp + cost.fn
prec.ord <- prec[order(rec)]
rec.ord <- rec[order(rec)]

plot(rec.ord, prec.ord, type = "l", main = "Precision/Recall Curve", xlab = "Recall", ylab = "Precision")

######################################################
## Area Under the Precision/Recall Curve is Calculated
######################################################

auc <- c()

for(i in 1:(num.thresh - 1)){
                auc[i] <- prec.ord[i] * ( rec.ord[i + 1] - rec.ord[i] )
}

#################
## Data is Output
#################

thresh.out <- 1:num.thresh * as.numeric(thresh >= min.out) * as.numeric(thresh <= max.out)
num.thresh.out <- length(thresh.out[thresh.out > 0])
min.thresh.out <- min(thresh.out[thresh.out > 0])

if( opt.by == "totalcost" ){
opt.val <- cost
}else if( opt.by == "precision" ){
opt.val <- prec
}else if( opt.by == "recall" ){
opt.val <- rec
}else if( opt.by == "precisionxrecall" ){
opt.val <- prec * rec
}

ind.opt <- order(opt.val, decreasing = TRUE)[1]

ind.out <- min.thresh.out + floor(1:num.out * num.thresh.out / num.out) - 1
out <- data.frame(rev(thresh[ind.out]), rev(true.pos[ind.out]), rev(true.neg[ind.out]), rev(false.pos[ind.out]), rev(false.neg[ind.out]), rev(prec[ind.out]), rev(rec[ind.out]), rev(c(0, auc)[ind.out]), rev(cost.tp[ind.out]), rev(cost.tn[ind.out]), rev(cost.fp[ind.out]), rev(cost.fn[ind.out]), rev(cost[ind.out]), rev(c(0,cumsum(auc))[ind.out]), thresh[ind.opt], prec[ind.opt], rec[ind.opt], cost.tp[ind.opt], cost.tn[ind.opt], cost.fp[ind.opt], cost.fn[ind.opt], cost[ind.opt])
names(out) <- c("Threshold", "True Positives", "True Negatives", "False Positives", "False Negatives", "Precision", "Recall", "Area Under P/R Curve", "True Positive Cost", "True Negative Cost", "False Positive Cost", "False Negative Cost", "Total Cost", "Cumulative Area Under P/R Curve", "Optimal Threshold", "Optimal Precision", "Optimal Recall", "Optimal True Positive Cost", "Optimal True Negative Cost", "Optimal False Positive Cost", "Optimal False Negative Cost", "Optimal Cost")

maml.mapOutputPort("out");

CODE END

We created this piece of code to help us examine the area under the Precision/Recall curve (AUC in Azure Machine Learning Studio refers to the area under the ROC curve) and to determine the optimal threshold for our data set.  It even allows us to input a custom cost function to determine how much money would have been saved and/or generated using the model.  Let's take a look at the results.
Experiment
Execute R Script Outputs
We can see that there are two different outputs from the "Execute R Script" module.  The first output, "Result Dataset", is where the data that we are trying to output is sent.  The second output, "R Device", is where any console displays or graphics are set.  Let's take a look at our results.
Execute R Script Results 1
Execute R Script Results 2
Execute R Script Graphics
We can see that this script outputs a subset of the thresholds evaluated, as well as the statistics using that threshold.  Also, it outputs the optimal threshold evaluated using a particular parameter that we will look at shortly.  Finally, it draws a graph of the Precision/Recall Curve.  For technical reasons, this graph is an approximation of the curve, not a complete representation.  A more accurate version of this curve can be seen in the "Evaluate Model" module.  We can choose our optimization value using the following set of code.

#############################################
## Choose an Optimize By option.  Options are
## "totalcost", "precision", "recall" and
## "precisionxrecall".
#############################################

opt.by <- "precisionxrecall"

In our case, we chose to optimize by using Precision * Recall.  Looking back at the results from our "Execute R Script", we see that our thresholds jump all the way from .000591 to .983291.  This is because the "Scored Probabilities" output from the "Score Model" module are very heavily skewed towards zero.  In turn, this skew is caused by the fact that our "Class" variable is heavily imbalanced.
Scored Probabilities Statistics
Scored Probabilities Histogram
Because of the way the R code is built, it determined that the optimal threshold of .001316 has a Precision of 82.9% and a Recall of 90.6%.  These values are worse than those originally reported by the "Tune Model Hyperparameters" module.  So, we can override the thresholds in our R code using the following code at the top of the batch:

thresh.override <- (10:90)/100

This will tell the R script to forcibly use thresholds from .10 to .90.  Let's check out the results.
Overridden Threshold Results 1
Overridden Threshold Results 2
We can see that by moving our threshold down to .42, we can tweak out slightly more value from our model.  However, this is a such a small amount of value that it's not worth any amount of effort to do it in this case.

So, when would this be useful?  As with everything, it all comes down to dollars.  We can talk to stakeholders and clients all day about how much Accuracy, Precision and Recall our models have.  In general, they aren't interested in that type of information.  However, if we can input an estimate of their cost function into this script, then we can tie real dollars to the model.  We were able to use this script to show a client that they could save $200k per year in lost product using their model.  That had far more impact than a Precision value ever would.

Hopefully, this post sparked your interest in tuning your Azure Machine Learning models to maximize their effectiveness.  We also want to emphasize that you can use R and Python to greatly extend the usefulness of Azure Machine Learning.  Stay tuned for the next post where we'll be talking about Feature Selection.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com