Some of you may be thinking "I thought this was going to be a real problem, not a fake one!" Turns out, we solved this kaggle problem in almost exactly the same way that we've solved real customers problems at work. The only difference here is that this data has been anonymized in order to protect everyone's privacy.
For this post, let's take a look at the data set.
|Credit Card Fraud Data 1|
|Credit Card Fraud Data 2|
Finally, let's talk about the "V1"-"V28" columns. These columns represent all of the other data we have about these customers and transactions combined into 28 numeric features. Obviously, there were far more than 28 original feature. However, in order to anonymize the data and reduce the number of features, the creator of the data set used a technique known as Principal Component Analysis (PCA). This is a well-known mathematical technique for creating a small number of very dense columns using a large number of sparse columns. Fortunately for the creators of this data set, it also has the advantage of anonymizing any data you use it on. While we won't dig into PCA in this post, there is an Azure Machine Learning module called Principal Component Analysis that will perform this technique for you. We may cover this module in a later post. Until then, you can read more about it here.
Hopefully we've piqued your interest about Fraud Detection in Azure Machine Learning. Feel free to hop right into the analysis and see what you can do on your own. Maybe you'll create a better model than us! Stay tuned for the next post where we'll be talking about cleaning up this data and preparing it for modelling. Thanks for reading. We hope you found this informative.