In this notebook, we are going to write a credit card fraud detection algorithm using binary classifier with linear regression. The algorihtm is pretty straightforward but the key idea here is to discuss the trade off with precision and recall. Also this problem exemplifies how accuracy is a useless metric when there is a class imbalance in the dataset, i.e. majority of the labels are 0 or non-fraudulent.
Background
Labeled Data
The payment fraud data set is provided by Kaggle from Dal Pozzolo et al. 2015. It has features and labels for thousands of credit card transactions, each of which is labeled as fraudulent or valid.
Binary Classification
We are going to use supervised learning to produce a binary classification model. Since it's linear, the model aims to produce a line that separates the valid and fraudulent transactions in the feature space. Although we can aim for better model like XGBoost or simple Neural Netowrk, the objective of this notebook is to explore what SageMaker can provide for us in terms of model improvements.
Step 1 Loading and Preparing Data
import matplotlib.pyplot as plt%matplotlib inlineimport ioimport osimport numpy as npimport pandas as pdimport boto3import sagemakersession = sagemaker.Session(default_bucket='machine-learning-case-studies')role = sagemaker.get_execution_role()bucket = session.default_bucket()
Notice the columns are hidden and normalized to protect the privacy of sources.
Label Imbalance
counts = transaction_df['Class'].value_counts()num_valid = counts[0]num_fraud = counts[1]print('Number of fraudulent labels:', num_fraud)print('Total number of data points:', num_valid + num_fraud)
Number of fraudulent labels: 492
Total number of data points: 284807
We have a severe imbalance of labels, only 0.017% of the data reports fraudulent usage of credit card.
Split Training/Test Set
We will do a simple 70% training 30% test split.
transaction_mat = transaction_df.valuesnp.random.seed(1)np.random.shuffle(transaction_mat)# This is a numpy arraynum_train =int(transaction_mat.shape[0] *0.70)# 70% of the data should be trainingfeatures_train = transaction_mat[:num_train,:-1]# Get everything except last columnlabels_train = transaction_mat[:num_train,-1]features_test = transaction_mat[num_train:,:-1]# Same herelabels_test = transaction_mat[num_train:,-1]
print('Training data length:', len(features_train))print('Test data length:', len(features_test))print('First item: \n', features_train[0])print('Label: ', labels_train[0])
Training data length: 199364
Test data length: 85443
First item:
[ 1.19907000e+05 -6.11711999e-01 -7.69705324e-01 -1.49759145e-01
-2.24876503e-01 2.02857736e+00 -2.01988711e+00 2.92491387e-01
-5.23020325e-01 3.58468461e-01 7.00499612e-02 -8.54022784e-01
5.47347360e-01 6.16448382e-01 -1.01785018e-01 -6.08491804e-01
-2.88559430e-01 -6.06199260e-01 -9.00745518e-01 -2.01311157e-01
-1.96039343e-01 -7.52077614e-02 4.55360454e-02 3.80739375e-01
2.34403159e-02 -2.22068576e+00 -2.01145578e-01 6.65013699e-02
2.21179560e-01 1.79000000e+00]
Label: 0.0
Step 2 Data Modeling
We will create a linear separator that separates the fraudulent data from the valid data.
Create the Model
We also have to tell SageMaker that it should a binary_classifier because the other options are multiclass_classifer and regressor.
# Convert numpy arrays into RecordSettraining_data_recordset = linear_learner.record_set(train=features_train.astype('float32'), labels=labels_train.astype('float32'))linear_learner.fit(training_data_recordset)
2020-04-09 05:45:28 Starting - Starting the training job...
2020-04-09 05:45:29 Starting - Launching requested ML instances......
2020-04-09 05:46:29 Starting - Preparing the instances for training......
2020-04-09 05:47:50 Downloading - Downloading input data
2020-04-09 05:47:50 Training - Downloading the training image...
...
Training seconds: 170
Billable seconds: 170
Now let's use the predictor to make a prediction. The predictor will return a list of Record protobuf messages. The prediction is stored in the predicted_label field.
The model has high accuracy but suffer a bit on recall and precision. If we put recall and precision in perspective,
Precision calculates how many predicted fraudulent cases are actually correct. If it has low precision, users will complain that their transactions are incorrectly labeled as fraudulent even when they are not.
Recall calculuates of all the real fraudulent cases, how many did the model catch? If it has low recall, banks will complain that the model fails to catch bad actors because their transactions went through despite they are fraudulent, aka high false negatives.
We can
Obtain high precision by having low false positive
Obtain high recall by having low false negative
In a perfect world, we want high precision, high recall and high accuracy. However, in real world, it is not always possible. We have to give a little trade off, like making the model optimize for high recall which is to not let any false negative to slip through the cracks.
Optimize for Recall
Suppose the bank wants a model that catches almost all the fraudulent transactions at the expense of annoying users, we can easily improve recall by telling SageMaker to optimize the training for recall.
Model selection criteria on precision at target recall means SageMaker will select the model with best precision at a target recall value. For example, we want 90% recall so SageMaker will pick a model that has the highest precision
2020-04-10 06:36:41 Starting - Starting the training job...
2020-04-10 06:36:43 Starting - Launching requested ML instances...
2020-04-10 06:37:41 Starting - Preparing the instances for training.........
2020-04-10 06:39:08 Downloading - Downloading input data
2020-04-10 06:39:08 Training - Downloading the training image...
...
2020-04-10 06:41:35 Uploading - Uploading generated training model
2020-04-10 06:41:35 Completed - Training job completed
Training seconds: 158
Billable seconds: 158
print('Evaluation for recall optimized LinearLearner model')evaluate(linear_predictor, features_test.astype('float32'), labels_test)
There is an overwhelming number of valid transactions in the data set, this will bias the model to make false negative predictions. We can push the recall even more by telling SageMaker to increase the ratio of positive examples.
To account for class imbalance during training of a binary classifier, LinearLearner offers the hyperparameter, positive_example_weight_mult, which is the weight assigned to positive (1, fraudulent) examples when training a binary classifier. The weight of negative examples (0, valid) is fixed at 1.
The weight assigned to positive examples when training a binary classifier. The weight of negative examples is fixed at 1. If you want the algorithm to choose a weight so that errors in classifying negative vs. positive examples have equal impact on training loss, specify balanced. If you want the algorithm to choose the weight that optimizes performance, specify auto.
2020-04-10 06:56:11 Starting - Starting the training job...
2020-04-10 06:56:12 Starting - Launching requested ML instances......
2020-04-10 06:57:40 Starting - Preparing the instances for training.........
2020-04-10 06:59:03 Downloading - Downloading input data
2020-04-10 06:59:03 Training - Downloading the training image...
...
2020-04-10 07:01:46 Completed - Training job completed
Training seconds: 178
Billable seconds: 178
-------------!
print('Evaluation for recall optimized and balanced LinearLearner model')evaluate(linear_predictor, features_test.astype('float32'), labels_test)
Evaluation for recall optimized and balanced LinearLearner model
prediction (col) 0.0 1.0
actual (row)
0.0 84163 1139
1.0 10 131
Recall: 0.929
Precision: 0.103
Accuracy: 0.987
{'tp': 131,
'tn': 84163,
'fp': 1139,
'fn': 10,
'precision': 0.1031496062992126,
'recall': 0.9290780141843972,
'accuracy': 0.9865524384677504}
On the other hand, the bank believes that customer experience is the most important. They are willing to lose some money, e.g. if the model fails to detect fraudulent transaction, the user will call the bank to claim the money back. Bank has to give the money back to the user and let the fraudster to get away with the crime. We must implement a model that optimizes for precision.
2020-04-10 20:33:48 Starting - Starting the training job...
2020-04-10 20:33:49 Starting - Launching requested ML instances......
2020-04-10 20:35:18 Starting - Preparing the instances for training.........
2020-04-10 20:36:42 Downloading - Downloading input data
2020-04-10 20:36:42 Training - Downloading the training image..
...
Training seconds: 172
Billable seconds: 172
print('Evaluation for precision optimized and balanced LinearLearner model')evaluate(linear_predictor, features_test.astype('float32'), labels_test)
Evaluation for precision optimized and balanced LinearLearner model
prediction (col) 0.0 1.0
actual (row)
0.0 85276 26
1.0 31 110
Recall: 0.780
Precision: 0.809
Accuracy: 0.999
{'tp': 110,
'tn': 85276,
'fp': 26,
'fn': 31,
'precision': 0.8088235294117647,
'recall': 0.7801418439716312,
'accuracy': 0.9993328885923949}