Model evaluation is "where the rubber meets the road", as they say. Up until now, we've been building a large list of candidate models. This is where we finally choose the one that we will use. Let's take a look at our experiment so far.
|Experiment So Far|
|Tune Model Hyperparameters|
|Sweep Results (Two-Class Boosted Decision Tree)|
Now, we have a way to choose the best possible model. The next step is to choose which evaluation metric we will use to rank all of these candidate models. The "Tune Model Hyperparameters" module has a few options.
Precision is the percentage of predicted "positive" records (Class = 1 -> "Fraud" in our case) that are correct. Notice that we said PREDICTED. Precision looks at the set of records where the model thinks Fraud has occurred. This metric is calculated as
(Number of Correct Positive Predictions) / (Number of Positive Predictions)
One of the huge advantages of Precision is that it doesn't care how "rare" the positive case is. This is extremely beneficial in our case because, in our opinion, 0.13% is extremely rare. We can see that we want precision to be as close to 1 (or 100%) as possible.
On the other hand, Recall is the percentage of actual "positive" records that are correct. This is slightly different from Precision in that it looks at the set of records where Fraud has actually occurred. This metric is calculated as
(Number of Correct Positive Predictions) / (Number of Actual Positive Records)
Just as with Precision, Recall doesn't care how rare the positive case is. Also like Precision, we want this value to be as close to 1 as possible.
In our minds, Precision is a measure of how accurate your fraud predictions are, while Recall is a measure of how much fraud the model is catching. Let's look back at our evaluation metrics for the "Tune Model Hyperparameters" module.
|Model Evaluation Experiment|
|MICE - Precision - Averaged Perceptron Results|
|PPCA - Recall - LD SVM - Binning Results|
|Save Trained Model|