# Fraud Detection with Linear Learner

In this notebook, we are going to write a credit card fraud detection algorithm using binary classifier with linear regression. The algorihtm is pretty straightforward but the key idea here is to discuss the trade off with precision and recall. Also this problem exemplifies how accuracy is a useless metric when there is a class imbalance in the dataset, i.e. majority of the labels are 0 or non-fraudulent.

## Background

### Labeled Data

The payment fraud data set is provided by [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud/data) from Dal Pozzolo et al. 2015. It has features and labels for thousands of credit card transactions, each of which is labeled as fraudulent or valid.

### Binary Classification

We are going to use supervised learning to produce a binary classification model. Since it's linear, the model aims to produce a line that separates the valid and fraudulent transactions in the feature space. Although we can aim for better model like XGBoost or simple Neural Netowrk, the objective of this notebook is to explore what SageMaker can provide for us in terms of model improvements.

## Step 1 Loading and Preparing Data

```python
import matplotlib.pyplot as plt
%matplotlib inline

import io
import os
import numpy as np
import pandas as pd
import boto3
import sagemaker

session = sagemaker.Session(default_bucket='machine-learning-case-studies')
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
```

### Download Data CSV File

```python
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
```

```
--2020-04-09 05:21:20--  https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.236.165
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.236.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69155632 (66M) [application/zip]
Saving to: ‘creditcardfraud.zip’

creditcardfraud.zip 100%[===================>]  65.95M  11.2MB/s    in 6.7s    

2020-04-09 05:21:28 (9.90 MB/s) - ‘creditcardfraud.zip’ saved [69155632/69155632]
```

```python
!unzip creditcardfraud
```

```
Archive:  creditcardfraud.zip
inflating: creditcard.csv          
```

```python
transaction_df = pd.read_csv('creditcard.csv')
print('Data shape (rows, cols): ', transaction_df.shape)
display(transaction_df.head())
```

```
Data shape (rows, cols):  (284807, 31)
```

|   | Time | V1        | V2        | V3       | V4        | V5        | V6        | V7        | V8        | V9        | ... | V21       | V22       | V23       | V24       | V25       | V26       | V27       | V28       | Amount | Class |
| - | ---- | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- | --------- | --- | --------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- | ------ | ----- |
| 0 | 0.0  | -1.359807 | -0.072781 | 2.536347 | 1.378155  | -0.338321 | 0.462388  | 0.239599  | 0.098698  | 0.363787  | ... | -0.018307 | 0.277838  | -0.110474 | 0.066928  | 0.128539  | -0.189115 | 0.133558  | -0.021053 | 149.62 | 0     |
| 1 | 0.0  | 1.191857  | 0.266151  | 0.166480 | 0.448154  | 0.060018  | -0.082361 | -0.078803 | 0.085102  | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288  | -0.339846 | 0.167170  | 0.125895  | -0.008983 | 0.014724  | 2.69   | 0     |
| 2 | 1.0  | -1.358354 | -1.340163 | 1.773209 | 0.379780  | -0.503198 | 1.800499  | 0.791461  | 0.247676  | -1.514654 | ... | 0.247998  | 0.771679  | 0.909412  | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0     |
| 3 | 1.0  | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203  | 0.237609  | 0.377436  | -1.387024 | ... | -0.108300 | 0.005274  | -0.190321 | -1.175575 | 0.647376  | -0.221929 | 0.062723  | 0.061458  | 123.50 | 0     |
| 4 | 2.0  | -1.158233 | 0.877737  | 1.548718 | 0.403034  | -0.407193 | 0.095921  | 0.592941  | -0.270533 | 0.817739  | ... | -0.009431 | 0.798278  | -0.137458 | 0.141267  | -0.206010 | 0.502292  | 0.219422  | 0.215153  | 69.99  | 0     |

5 rows × 31 columns

> Notice the columns are hidden and normalized to protect the privacy of sources.

### Label Imbalance

```python
counts = transaction_df['Class'].value_counts()
num_valid = counts[0]
num_fraud = counts[1]

print('Number of fraudulent labels:', num_fraud)
print('Total number of data points:', num_valid + num_fraud)
```

```
Number of fraudulent labels: 492
Total number of data points: 284807
```

We have a severe imbalance of labels, only 0.017% of the data reports fraudulent usage of credit card.

### Split Training/Test Set

We will do a simple 70% training 30% test split.

```python
transaction_mat = transaction_df.values
np.random.seed(1)
np.random.shuffle(transaction_mat) # This is a numpy array

num_train = int(transaction_mat.shape[0] * 0.70) # 70% of the data should be training

features_train = transaction_mat[:num_train, :-1] # Get everything except last column
labels_train = transaction_mat[:num_train, -1]

features_test = transaction_mat[num_train:, :-1] # Same here
labels_test = transaction_mat[num_train:, -1]
```

```python
print('Training data length:', len(features_train))
print('Test data length:', len(features_test))

print('First item: \n', features_train[0])
print('Label: ', labels_train[0])
```

```
Training data length: 199364
Test data length: 85443
First item: 
 [ 1.19907000e+05 -6.11711999e-01 -7.69705324e-01 -1.49759145e-01
 -2.24876503e-01  2.02857736e+00 -2.01988711e+00  2.92491387e-01
 -5.23020325e-01  3.58468461e-01  7.00499612e-02 -8.54022784e-01
  5.47347360e-01  6.16448382e-01 -1.01785018e-01 -6.08491804e-01
 -2.88559430e-01 -6.06199260e-01 -9.00745518e-01 -2.01311157e-01
 -1.96039343e-01 -7.52077614e-02  4.55360454e-02  3.80739375e-01
  2.34403159e-02 -2.22068576e+00 -2.01145578e-01  6.65013699e-02
  2.21179560e-01  1.79000000e+00]
Label:  0.0
```

## Step 2 Data Modeling

We will create a linear separator that separates the fraudulent data from the valid data.

![Linear Separator](/files/-M5_HfiYHb6R0sUReemf)

### Create the Model

We also have to tell SageMaker that it should a `binary_classifier` because the other options are `multiclass_classifer` and `regressor`.

```python
s3_prefix = 'fraud_detection'
output_path = 's3://{}/{}/'.format(bucket, s3_prefix)
linear_learner = sagemaker.LinearLearner(role=role,
                                         train_instance_count=1,
                                         train_instance_type='ml.c4.xlarge',
                                         predictor_type='binary_classifier',
                                         output_path=output_path,
                                         sagemaker_session=session,
                                         epochs=20)
```

### Train it

```python
# Convert numpy arrays into RecordSet
training_data_recordset = linear_learner.record_set(train=features_train.astype('float32'),
                                                    labels=labels_train.astype('float32'))

linear_learner.fit(training_data_recordset)
```

```
2020-04-09 05:45:28 Starting - Starting the training job...
2020-04-09 05:45:29 Starting - Launching requested ML instances......
2020-04-09 05:46:29 Starting - Preparing the instances for training......
2020-04-09 05:47:50 Downloading - Downloading input data
2020-04-09 05:47:50 Training - Downloading the training image...
...
Training seconds: 170
Billable seconds: 170
```

```python
linear_predictor = linear_learner.deploy(initial_instance_count=1,
                                         instance_type='ml.t2.medium')
```

## Step 3 Model Evaluation

Now let's use the predictor to make a prediction. The predictor will return a list of `Record` protobuf messages. The prediction is stored in the `predicted_label` field.

```python
sample_input = features_test.astype('float32')
print(linear_predictor.predict(sample_input[0]))
```

```
[label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 0.0
    }
  }
}
label {
  key: "score"
  value {
    float32_tensor {
      values: 0.0017995035741478205
    }
  }
}
]
```

We need to write a helper so we can re-use it later for model evaluation.

```python
def evaluate(predictor, features_test, labels_test, test_batch_size=100, verbose=True):
    """
    Evaluate a model using a test set based on precision, recall and accuracy
    """
    # Split the data into 100 batches.
    input_batches = [predictor.predict(batch) for batch in np.array_split(features_test, test_batch_size)]
    predictions = np.concatenate(
                    [
                        np.array(
                            [x.label['predicted_label'].float32_tensor.values[0] for x in batch]
                        )
                        for batch in input_batches
                    ]
                  )

    true_pos = np.logical_and(labels_test, predictions).sum()
    false_pos = np.logical_and(1-labels_test, predictions).sum()
    true_neg = np.logical_and(1-labels_test, 1-predictions).sum()
    false_neg = np.logical_and(labels_test, 1-predictions).sum()

    recall = true_pos / (true_pos + false_neg)
    precision = true_pos / (true_pos + false_pos)
    accuracy = (true_pos + true_neg) / (true_pos + false_pos + true_neg + false_neg)

    # Print a table of metrics
    if verbose:
        print(pd.crosstab(labels_test, predictions, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()

    return {
        'tp': true_pos,
        'tn': true_neg,
        'fp': false_pos,
        'fn': false_neg,
        'precision': precision,
        'recall': recall,
        'accuracy': accuracy
    }
```

Now we can evaluate the model

```python
print('Evaluation for basic LinearLearner model')
evaluate(linear_predictor, features_test.astype('float32'), labels_test)
```

```
Evaluation for basic LinearLearner model
prediction (col)    0.0  1.0
actual (row)                
0.0               85268   34
1.0                  33  108

Recall:     0.766
Precision:  0.761
Accuracy:   0.999

{'tp': 108,
 'tn': 85268,
 'fp': 34,
 'fn': 33,
 'precision': 0.7605633802816901,
 'recall': 0.7659574468085106,
 'accuracy': 0.9992158515033414}
```

```python
session.delete_endpoint(linear_predictor.endpoint)
```

## Step 4 Precision vs Recall

The model has high accuracy but suffer a bit on recall and precision. If we put recall and precision in perspective,

* Precision calculates how many predicted fraudulent cases are actually correct. If it has low precision, users will complain that their transactions are incorrectly labeled as fraudulent even when they are not.
* Recall calculuates of all the real fraudulent cases, how many did the model catch? If it has low recall, banks will complain that the model fails to catch bad actors because their transactions went through despite they are fraudulent, aka high false negatives.

We can

* Obtain high precision by having low false positive
* Obtain high recall by having low false negative

![precision and recall](/files/-M5_HficmpRUJ4IdTgzE)

In a perfect world, we want high precision, high recall and high accuracy. However, in real world, it is not always possible. We have to give a little trade off, like making the model optimize for high recall which is to not let any false negative to slip through the cracks.

### Optimize for Recall

Suppose the bank wants a model that catches almost all the fraudulent transactions at the expense of annoying users, we can easily improve recall by telling SageMaker to optimize the training for recall.

> Model selection criteria on precision at target recall means SageMaker will select the model with best precision at a target recall value. For example, we want 90% recall so SageMaker will pick a model that has the highest precision

```python
linear_learner = sagemaker.LinearLearner(role=role,
                                         train_instance_count=1,
                                         train_instance_type='ml.c4.xlarge',
                                         predictor_type='binary_classifier',
                                         output_path=output_path,
                                         sagemaker_session=session,
                                         epochs=20,
                                         binary_classifier_model_selection_criteria='precision_at_target_recall',
                                         target_recall=0.9) # Aim for 90% recall

linear_learner.fit(training_data_recordset)

linear_predictor = linear_learner.deploy(initial_instance_count=1,
                                         instance_type='ml.t2.medium')
```

```
2020-04-10 06:36:41 Starting - Starting the training job...
2020-04-10 06:36:43 Starting - Launching requested ML instances...
2020-04-10 06:37:41 Starting - Preparing the instances for training.........
2020-04-10 06:39:08 Downloading - Downloading input data
2020-04-10 06:39:08 Training - Downloading the training image...
...
2020-04-10 06:41:35 Uploading - Uploading generated training model
2020-04-10 06:41:35 Completed - Training job completed
Training seconds: 158
Billable seconds: 158
```

```python
print('Evaluation for recall optimized LinearLearner model')
evaluate(linear_predictor, features_test.astype('float32'), labels_test)
```

```
Evaluation for recall optimized LinearLearner model
prediction (col)    0.0   1.0
actual (row)                 
0.0               82656  2646
1.0                  10   131

Recall:     0.929
Precision:  0.047
Accuracy:   0.969

{'tp': 131,
 'tn': 82656,
 'fp': 2646,
 'fn': 10,
 'precision': 0.04717320849837955,
 'recall': 0.9290780141843972,
 'accuracy': 0.9689149491473849}
```

```python
session.delete_endpoint(linear_predictor.endpoint)
```

### Account for Class Imbalance

There is an overwhelming number of valid transactions in the data set, this will bias the model to make false negative predictions. We can push the recall even more by telling SageMaker to increase the ratio of positive examples.

To account for class imbalance during training of a binary classifier, LinearLearner offers the hyperparameter, positive\_example\_weight\_mult, which is the weight assigned to positive (1, fraudulent) examples when training a binary classifier. The weight of negative examples (0, valid) is fixed at 1.

> The weight assigned to positive examples when training a binary classifier. The weight of negative examples is fixed at 1. If you want the algorithm to choose a weight so that errors in classifying negative vs. positive examples have equal impact on training loss, specify balanced. If you want the algorithm to choose the weight that optimizes performance, specify auto.

```python
linear_learner = sagemaker.LinearLearner(role=role,
                                         train_instance_count=1,
                                         train_instance_type='ml.c4.xlarge',
                                         predictor_type='binary_classifier',
                                         output_path=output_path,
                                         sagemaker_session=session,
                                         epochs=20,
                                         binary_classifier_model_selection_criteria='precision_at_target_recall',
                                         positive_example_weight_mult='balanced', # Use Balanced
                                         target_recall=0.9) # Aim for 90% recall

linear_learner.fit(training_data_recordset)

linear_predictor = linear_learner.deploy(initial_instance_count=1,
                                         instance_type='ml.t2.medium')
```

```
2020-04-10 06:56:11 Starting - Starting the training job...
2020-04-10 06:56:12 Starting - Launching requested ML instances......
2020-04-10 06:57:40 Starting - Preparing the instances for training.........
2020-04-10 06:59:03 Downloading - Downloading input data
2020-04-10 06:59:03 Training - Downloading the training image...
...
2020-04-10 07:01:46 Completed - Training job completed
Training seconds: 178
Billable seconds: 178
-------------!
```

```python
print('Evaluation for recall optimized and balanced LinearLearner model')
evaluate(linear_predictor, features_test.astype('float32'), labels_test)
```

```
Evaluation for recall optimized and balanced LinearLearner model
prediction (col)    0.0   1.0
actual (row)                 
0.0               84163  1139
1.0                  10   131

Recall:     0.929
Precision:  0.103
Accuracy:   0.987

{'tp': 131,
 'tn': 84163,
 'fp': 1139,
 'fn': 10,
 'precision': 0.1031496062992126,
 'recall': 0.9290780141843972,
 'accuracy': 0.9865524384677504}
```

```python
session.delete_endpoint(linear_predictor.endpoint)
```

### Optimize for Precision

On the other hand, the bank believes that customer experience is the most important. They are willing to lose some money, e.g. if the model fails to detect fraudulent transaction, the user will call the bank to claim the money back. Bank has to give the money back to the user and let the fraudster to get away with the crime. We must implement a model that optimizes for precision.

```python
linear_learner = sagemaker.LinearLearner(role=role,
                                         train_instance_count=1,
                                         train_instance_type='ml.c4.xlarge',
                                         predictor_type='binary_classifier',
                                         output_path=output_path,
                                         sagemaker_session=session,
                                         epochs=20,
                                         binary_classifier_model_selection_criteria='recall_at_target_precision',
                                         positive_example_weight_mult='balanced', # Use Balanced
                                         target_precision=0.9) # Aim for 90% precision

linear_learner.fit(training_data_recordset)

linear_predictor = linear_learner.deploy(initial_instance_count=1,
                                         instance_type='ml.t2.medium')
```

```
2020-04-10 20:33:48 Starting - Starting the training job...
2020-04-10 20:33:49 Starting - Launching requested ML instances......
2020-04-10 20:35:18 Starting - Preparing the instances for training.........
2020-04-10 20:36:42 Downloading - Downloading input data
2020-04-10 20:36:42 Training - Downloading the training image..
...
Training seconds: 172
Billable seconds: 172
```

```python
print('Evaluation for precision optimized and balanced LinearLearner model')
evaluate(linear_predictor, features_test.astype('float32'), labels_test)
```

```
Evaluation for precision optimized and balanced LinearLearner model
prediction (col)    0.0  1.0
actual (row)                
0.0               85276   26
1.0                  31  110

Recall:     0.780
Precision:  0.809
Accuracy:   0.999

{'tp': 110,
 'tn': 85276,
 'fp': 26,
 'fn': 31,
 'precision': 0.8088235294117647,
 'recall': 0.7801418439716312,
 'accuracy': 0.9993328885923949}
```

```python
session.delete_endpoint(linear_predictor.endpoint)
```

## Final Remarks

We can take a final view on what came out from the various stages of improvement.

| Model                           | Recall | Precision | Accuracy |
| ------------------------------- | ------ | --------- | -------- |
| Baseline                        | 0.766  | 0.761     | 0.999    |
| Optimized on Recall             | 0.929  | 0.047     | 0.969    |
| Balanced/Optimized on Recall    | 0.929  | 0.103     | 0.987    |
| Balanced/Optimized on Precision | 0.780  | 0.809     | 0.999    |

Our linear model has its limitation, if we want a higher precision we must turn to nonlinear models.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://calvinfeng.gitbook.io/machine-learning-notebook/sagemaker/fraud_detection.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
