Fraud Detection with Linear Learner

In this notebook, we are going to write a credit card fraud detection algorithm using binary classifier with linear regression. The algorihtm is pretty straightforward but the key idea here is to discuss the trade off with precision and recall. Also this problem exemplifies how accuracy is a useless metric when there is a class imbalance in the dataset, i.e. majority of the labels are 0 or non-fraudulent.

Background

Labeled Data

The payment fraud data set is provided by Kaggle from Dal Pozzolo et al. 2015. It has features and labels for thousands of credit card transactions, each of which is labeled as fraudulent or valid.

Binary Classification

We are going to use supervised learning to produce a binary classification model. Since it's linear, the model aims to produce a line that separates the valid and fraudulent transactions in the feature space. Although we can aim for better model like XGBoost or simple Neural Netowrk, the objective of this notebook is to explore what SageMaker can provide for us in terms of model improvements.

Step 1 Loading and Preparing Data

import matplotlib.pyplot as plt
%matplotlib inline

import io
import os
import numpy as np
import pandas as pd
import boto3
import sagemaker

session = sagemaker.Session(default_bucket='machine-learning-case-studies')
role = sagemaker.get_execution_role()
bucket = session.default_bucket()

Download Data CSV File

Time
V1
V2
V3
V4
V5
V6
V7
V8
V9
...
V21
V22
V23
V24
V25
V26
V27
V28
Amount
Class

0

0.0

-1.359807

-0.072781

2.536347

1.378155

-0.338321

0.462388

0.239599

0.098698

0.363787

...

-0.018307

0.277838

-0.110474

0.066928

0.128539

-0.189115

0.133558

-0.021053

149.62

0

1

0.0

1.191857

0.266151

0.166480

0.448154

0.060018

-0.082361

-0.078803

0.085102

-0.255425

...

-0.225775

-0.638672

0.101288

-0.339846

0.167170

0.125895

-0.008983

0.014724

2.69

0

2

1.0

-1.358354

-1.340163

1.773209

0.379780

-0.503198

1.800499

0.791461

0.247676

-1.514654

...

0.247998

0.771679

0.909412

-0.689281

-0.327642

-0.139097

-0.055353

-0.059752

378.66

0

3

1.0

-0.966272

-0.185226

1.792993

-0.863291

-0.010309

1.247203

0.237609

0.377436

-1.387024

...

-0.108300

0.005274

-0.190321

-1.175575

0.647376

-0.221929

0.062723

0.061458

123.50

0

4

2.0

-1.158233

0.877737

1.548718

0.403034

-0.407193

0.095921

0.592941

-0.270533

0.817739

...

-0.009431

0.798278

-0.137458

0.141267

-0.206010

0.502292

0.219422

0.215153

69.99

0

5 rows × 31 columns

Notice the columns are hidden and normalized to protect the privacy of sources.

Label Imbalance

We have a severe imbalance of labels, only 0.017% of the data reports fraudulent usage of credit card.

Split Training/Test Set

We will do a simple 70% training 30% test split.

Step 2 Data Modeling

We will create a linear separator that separates the fraudulent data from the valid data.

Linear Separator

Create the Model

We also have to tell SageMaker that it should a binary_classifier because the other options are multiclass_classifer and regressor.

Train it

Step 3 Model Evaluation

Now let's use the predictor to make a prediction. The predictor will return a list of Record protobuf messages. The prediction is stored in the predicted_label field.

We need to write a helper so we can re-use it later for model evaluation.

Now we can evaluate the model

Step 4 Precision vs Recall

The model has high accuracy but suffer a bit on recall and precision. If we put recall and precision in perspective,

  • Precision calculates how many predicted fraudulent cases are actually correct. If it has low precision, users will complain that their transactions are incorrectly labeled as fraudulent even when they are not.

  • Recall calculuates of all the real fraudulent cases, how many did the model catch? If it has low recall, banks will complain that the model fails to catch bad actors because their transactions went through despite they are fraudulent, aka high false negatives.

We can

  • Obtain high precision by having low false positive

  • Obtain high recall by having low false negative

precision and recall

In a perfect world, we want high precision, high recall and high accuracy. However, in real world, it is not always possible. We have to give a little trade off, like making the model optimize for high recall which is to not let any false negative to slip through the cracks.

Optimize for Recall

Suppose the bank wants a model that catches almost all the fraudulent transactions at the expense of annoying users, we can easily improve recall by telling SageMaker to optimize the training for recall.

Model selection criteria on precision at target recall means SageMaker will select the model with best precision at a target recall value. For example, we want 90% recall so SageMaker will pick a model that has the highest precision

Account for Class Imbalance

There is an overwhelming number of valid transactions in the data set, this will bias the model to make false negative predictions. We can push the recall even more by telling SageMaker to increase the ratio of positive examples.

To account for class imbalance during training of a binary classifier, LinearLearner offers the hyperparameter, positive_example_weight_mult, which is the weight assigned to positive (1, fraudulent) examples when training a binary classifier. The weight of negative examples (0, valid) is fixed at 1.

The weight assigned to positive examples when training a binary classifier. The weight of negative examples is fixed at 1. If you want the algorithm to choose a weight so that errors in classifying negative vs. positive examples have equal impact on training loss, specify balanced. If you want the algorithm to choose the weight that optimizes performance, specify auto.

Optimize for Precision

On the other hand, the bank believes that customer experience is the most important. They are willing to lose some money, e.g. if the model fails to detect fraudulent transaction, the user will call the bank to claim the money back. Bank has to give the money back to the user and let the fraudster to get away with the crime. We must implement a model that optimizes for precision.

Final Remarks

We can take a final view on what came out from the various stages of improvement.

Model
Recall
Precision
Accuracy

Baseline

0.766

0.761

0.999

Optimized on Recall

0.929

0.047

0.969

Balanced/Optimized on Recall

0.929

0.103

0.987

Balanced/Optimized on Precision

0.780

0.809

0.999

Our linear model has its limitation, if we want a higher precision we must turn to nonlinear models.

Last updated