Machine Learning Notebook
  • Introduction
  • Supervised Learning
    • Basic Overview
      • Numpy Basics
      • Loss Functions
      • Evaluation Metrics
    • Convolutional Neural Network
      • Convolution Operation
      • Transpose Convolution Operation
      • Batch Normalization
      • Weight Initialization
      • Segmentation
    • Diffusion
      • KL Divergence
      • Variational Inference
      • Variational Autoencoder
      • Stable Diffusion Overview
      • Stable Diffusion Deep Dive
    • Naive Bayes
    • Decision Tree
      • Random Forest
      • Gradient Boosting
    • Natural Language Processing
      • Word2Vec
    • Search
      • Nearest Neighbor Search
    • Recommender
      • Singular Value Decomposition
      • Low Rank Matrix Factorization
      • Neural Collaborative Filtering
      • Sampling Bias Corrected Neural Modeling for Large Corpus Item Recommendations
      • Real-time Personalization using Embeddings for Search Ranking
      • Wide and Deep Learning for Recommender Systems
    • Recurrent Neural Network
      • Vanilla Recurrent Neural Network
      • LSTM Recurrent Neural Network
  • Unsupervised Learning
    • Clustering
      • Spectral Clustering
    • Reinforcement Learning
      • Deep Q Learning
      • Policy Gradients
  • SageMaker
    • Population Segmentation with PCA and KMeans
    • Fraud Detection with Linear Learner
    • Time Series Forecast with DeepAR
    • PyTorch Non-linear Classifier
Powered by GitBook
On this page
  • Background
  • Evaluation Metrics
  • Bias vs Variance
  • Random Forest
  • Bagging
  • AdaBoost
  • Gradient Boost
  • Build a Tree
  • Build More Tree Incrementally
  • Mathematical Formulation
  • For Classifcation Problems
  • Initial Predictions
  • Conversion to Probability
  • Residual Calculation
  • Subsequent Tree Terminal Region Outputs
  1. Supervised Learning
  2. Decision Tree

Gradient Boosting

PreviousRandom ForestNextNatural Language Processing

Last updated 2 years ago

Background

Before we dive into gradient boosting, couple important concepts need to be explained first.

Evaluation Metrics

This is a quick review of metrics for evaluating machine learning models. Suppose I want to classify image into 3 categories, a dog, a cat, and a human, there only four possible outcome for a classification model if we center our attention on dog classification.

  • True positive (TP): model predicted dog and it is a picture of dog.

  • True negative (TP): model predicted not a dog, maybe a cat or a human, and indeed it is NOT a picture of dog.

  • False positive (FP): model predicted dog but it is NOT a picture of dog.

  • False negative (TP): model predicted not a dog but it ACTUALLY a picture of dog.

Sensitivity, or recall or true positive rate is asking, of all dog images, how many were predicted to be dog?

sensitivitydog=TPTP+FN\text{sensitivity}_{dog} = \frac{TP}{TP + FN}sensitivitydog​=TP+FNTP​

Specificity, or true negative rate is asking, of all non-dog images, how many were predicted to be NOT dog?

specificitydog=TNTN+FP\text{specificity}_{dog} = \frac{TN}{TN + FP}specificitydog​=TN+FPTN​

Precision, or positive predictive value is asking, of all the predicted dog labels, how of the input images are actually dogs?

Precision and recall are the two most common metrics. I personally have not seen people using specificity. People tend only care about producing the correct label. Maybe specificity is useful for fraud and disease detection. However, for those cases, it's often better to have false positive than to have false negative because false negative could mean death to an ill person.

Bias vs Variance

The inability for a machine learning method to capture the true relationship is called bias. The difference in fits between data sets is called variance. High variance means the fits vary greatly between different data set, which is often a result of overfitting to one data set. The perfect model would have low bias; it fits the data very well because it captures the relationship, have low variability because it produces consistent predictions across different datasets.

This is done by finding the sweeet spot between a simple model and a complicated model. There are three commonly used methods for finding the sweet spot between simple and complicated models.

  • Regularization (to prevent overfitting as explained in my neural network sections)

  • Boosting

  • Bagging

Random Forest

Decision trees are easy to build, easy to use, and easy to interpret but in practice they are not that useful.

Quoted The Elements of Statistical Learning, "Trees have one apsect that prevents them from being the ideal tool for predictive learning, namely inaccuracy

In other words, they work great with the data used to create them, but they are not flexible when it comes to classifying new samples. Random forest solves this problem by combining the simplicity of decision trees with flexibility resulting in a vast improvement in accuracy.

Bagging

Start off with bootstrapping, to create a bootstrapped dataset that is the same size as the original, we just randomly select samples from the original dataset but we're allowed to pick the same sample more than once. This is simply known as random sampling with replacement. Then we create a decision tree using the bootstrapped dataset, but only use a subset of variables or columns at each step. How many columns we use at each node splitting step is a hyperparameter that requires tuning.

In summary,

  1. We build a tree using a bootstrapped dataset.

  2. Only considering a random subset of variables at each step.

  3. Repeat 1 & 2 until we have X trees and we have a forest.

  4. Random forest will predict by aggregating the result of all tree, take the majority vote.

  5. We take the out-of-bag data (data that didn't go into creating the tree) to evaluate accuracy of the forest.

  6. Tune the hyperparemter by repeating step 1 to 5.

Why is it called bagging ? Because it bootstrapped and agggregated.

AdaBoost

AdaBoost is short for adaptive boosting. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

In the forest of trees made with AdaBoost, the trees are usually just a node with two leaves. We call this a stump. Stumps are not great at making accurate classifications because stump can only use one variable to make a decision. However, the trees of a regular random forest vote with an equal weight. In contrast, some stumps of a AdaBoosted forest get more say in the final classification than others.

Order matters when we construct stumps for an AdaBoosted forest. The errors that the first stump makes influence how the second stump is made.

Key ideas,

  1. AdaBoost combines a lot of weaker learners to make classifications. The weak learners are almost always stumps.

  2. Some stumps get more say in the classification than others.

  3. Each stump is made by taking the previous stump's mistakes into account.

In order to take in the previous stump's mistakes, we assign each sample (data point) a weight initially. Each time a stump misclassifies these samples, the weights will be adjusted according.

Since the focus of this writeup isn't on AdaBoost, I can skip the mathemtical details for now and re-add them later.

When we try to make a second stump, if we have a weighted gini function, then we use it with the sample weights, otherwise use the sample weights to make a new dataset that reflects these weights.

Gradient Boost

Now we can talk about gradient boost. Suppose we are given a data set to predict weight for a person.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [[1.6, 'Blue', 'Male', 88],
     [1.6, 'Green', 'Female', 76],
     [1.5, 'Blue', 'Female', 56],
     [1.8, 'Red', 'Male', 73],
     [1.5, 'Green', 'Male', 77],
     [1.4, 'Blue', 'Female', 57]], columns=['height', 'favorite_color', 'gender', 'weight'])

display(df)
/home/cfeng/Desktop/machine-learning-notebook/random_forest_py3/env/lib/python3.7/site-packages/pandas/compat/__init__.py:117: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
height
favorite_color
gender
weight

0

1.6

Blue

Male

88

1

1.6

Green

Female

76

2

1.5

Blue

Female

56

3

1.8

Red

Male

73

4

1.5

Green

Male

77

5

1.4

Blue

Female

57

Build a Tree

The first tree we build assumes that the predicted weight for everyone is the mean of weight from the dataset. You can imagine that this tree only has a leaf node and that leaf node always return the following value regardless of inputs.

df['weight'].mean()
71.16666666666667

The second tree we build will be based on the errors that the previous tree made. We first need to compute the errors and save that errors as a new column named pseudo_residual. The term Pseudo Residual is based on linear regression, where the difference between observed values and the predicted values results in residuals.

df['pseudo_residual_1'] = np.round(df['weight'] - df['weight'].mean(), decimals=1)
display(df)
height
favorite_color
gender
weight
pseudo_residual_1

0

1.6

Blue

Male

88

16.8

1

1.6

Green

Female

76

4.8

2

1.5

Blue

Female

56

-15.2

3

1.8

Red

Male

73

1.8

4

1.5

Green

Male

77

5.8

5

1.4

Blue

Female

57

-14.2

Now we use height, favorite color, and gender to predict the residuals. This seems strange to predict the residuals instead of the original weights. The answer will become clear soon. For this example, we will limit the maximum number of leaves to be 4. It's common to use 8 to 32 for larger dataset.

Some leaves have multiple values and we need to average them. Now we have two trees that look like this.

To produce a prediction for [1.6, 'Blue', 'Male'], we feed the input to first tree and obtain 71.2 as the output. We feed the data into the second tree, and obtain 16.8 as the second output. We sum the two values.

Looks like it is working! But it should be apparent athat this tree is way overfitted. It has low bias and very high variance.

Build More Tree Incrementally

Instead of making a big jump, we are taking an incremental approach to nudge the predicted value toward the desired place. So far we haven't talked about the gradient part yet.

Empirical evidence shows that taking lots of small steps in the right direction results in better predictions with a testing data, ie. lower variance.

Let's build the third tree using the same algorithm we used to build the second tree. Pretend that we have the tree and we feed our data frame into it, here's the new residual value.

df.insert(5, 'psuedo_residual_2', [15.1, 4.3, -13.7, 1.4, 5.4, -12.7])
display(df)
height
favorite_color
gender
weight
pseudo_residual_1
psuedo_residual_2

0

1.6

Blue

Male

88

16.8

15.1

1

1.6

Green

Female

76

4.8

4.3

2

1.5

Blue

Female

56

-15.2

-13.7

3

1.8

Red

Male

73

1.8

1.4

4

1.5

Green

Male

77

5.8

5.4

5

1.4

Blue

Female

57

-14.2

-12.7

Now we build the third tree and using the same averaging technique we did before. We have 3 tree!

We will repeat this process many times until we reach the maximum number of tree or adding additional trees does not significantly reduce the size of the residuals. The final model can be summarized with the following equation.

Mathematical Formulation

This is the mathematical interpretation of what we did above. We have a data set with N rows, denoted as follows

We have a loss function to evaluate how does the tree model produces predictions. We will use L2 loss which is just a squared residual. This loss function is differentiable.

Step 1 Solve for First Term

We take derivative of this function with respect to gamma and find the optimal value of gamma that minimizes this equation by setting the derivative to 0 and solve for gamma. It shouldn't be too hard.

Therefore,

What does this look like? It's the mean of the observed values!

Step 2 Solve for Second, Third, etc... Terms

Step 2a

Now we know why we used mean for the first tree. We can proceed to solve for the second, third, ..., and Nth tree. We first need to compute the residual value for ith data and mth tree.

Therefore, using chain rule, we can simplify it to the following.

Step 2b

Step 2c

The minimization is the same as what we did in Step 1. One small difference is that now we are taking the previous prediction into account. This summation is also picky about which samples it includes, while before the summation included all samples. This new summation means that only sum over samples that belong to the leaf node. At this point we should be using the real loss function to simplify the math, aka.

Step 2d

Finally, we update the original model.

This is basically saying the new prediction using 2 tree is based on the predictions made by the first tree plus the learning rate scaled predictions made by the second tree. Now we must repeat this step until m ~ 100 or even more, depending on complexity of the data we are trying to model.

Step 3

For Classifcation Problems

So far we've only covered gradient boosting for regression problem. The usage of Gradient Boost in classification will be similiar to that of a logistc function. We will be working with probabilities.

Suppose we are performing a binary classification, answering a yes-or-no question, the following is the list of modication needed to make Gradient Boost work for the problem.

Initial Predictions

Instead of using the mean or average of all samples, the initial prediction should be log of the odds. The odds is a ratio of yes to no samples. Suppose we have 6 people, we ask whether they like Toy Story and we have:

user_responses = [True, True, False, False, True, True]

The odds would be 4 to 2, or 0.5. Side note, there's another name for log of the odds, it's called the logit function.

Therefore, the initial prediction, or intiial leaf node is

Conversion to Probability

We cannot use the output of logit function as output of the leaf directly. We need to convert it to value of probability.

For the example above, the exponential functions will cancel with the log functions. We are left with a simple calculation.

Residual Calculation

The calculation for residual here is a bit different from regression example because we are dealing with probability here. If a sample says yes, he/she likes Toy Story, the value for True is 1, and the value for False is 0. Our first leaft node will always predict 0.667 for the likelihood of liking Toy Story. Suppose for a sample that likes Toy Story, residual is calculated as follows.

Subsequent Tree Terminal Region Outputs

The first tree has only one leaf node, the output is the same for all samples. The second tree will have more leaf nodes and decision branching. When we used Gradient Boost for regression, a leaf with single residual had an output value equal to that residual. If the leaf had more than one residual values, the output value would equal to the mean of all residual values in that leaf.

The situation is different for probability. We need a transformation, and the output value would be the following.

The previous probability refers to the previous output from the initial leaf of tree 1. Also notice that final output of each tree is a logit or log of the odds output. Suppose we have 100 trees, we would sum up the contribution of all tree as logits and convert the final sum to a probability value.

precisiondog=TPTP+FP\text{precision}_{dog} = \frac{TP}{TP + FP}precisiondog​=TP+FPTP​
F(x)=sign(∑m=1Mθmfm(x)))F(x) = \text{sign}(\sum^{M}_{m=1} \theta_{m}f_{m}(x)))F(x)=sign(m=1∑M​θm​fm​(x)))

fmf_mfm​ is a weaker learner and θm\theta_mθm​ is the corresponding weight.

71.2+16.8=8871.2 + 16.8 = 8871.2+16.8=88

Gradient boost deals with the variance problem by using a learning rate to scale the contribution from the new tree (tree on the right hand side.) The learning rate is usually a value between 0 and 1. I will use α\alphaα to denote learning rate. Suppose we want to use learning_rate=0.1, the contribution from second tree will become the following.

71.2+α16.8=71.2+(0.1)(16.8)=72.8871.2 + \alpha 16.8 = 71.2 + (0.1)(16.8) = 72.8871.2+α16.8=71.2+(0.1)(16.8)=72.88
prediction=tree1(X)+αΣm=2Mtreem(X)\text{prediction} = \text{tree}_1(X) + \alpha\Sigma^{M}_{m = 2}\text{tree}_m(X)prediction=tree1​(X)+αΣm=2M​treem​(X)
(Xi,yi)i=1N{(X_i, y_i)}^N_{i=1}(Xi​,yi​)i=1N​
L(yi,F(Xi))=(yi−ypredicted)2L(y_i, F(X_i)) = (y_i - y_{predicted})^2L(yi​,F(Xi​))=(yi​−ypredicted​)2

We should imagine that we are trying to build a predictive model FFF using series of first order approximation, or you can simply argue it is doing a first order Taylor series approximation. Let's start off with first term. We agree that F0F_0F0​ is a term we want to compute but we don't know what value should it be. Let gamma γ\gammaγ denote this value and we have an optimization task to perform.

F0=argminγ  Σi=1NL(yi,γ)F_0 = \text{argmin}_{\gamma}\;\Sigma^N_{i=1} L(y_i, \gamma)F0​=argminγ​Σi=1N​L(yi​,γ)
0=Σi=1N2(yi−γ)=Σi=1Nyi−Nγ0 = \Sigma^N_{i=1} 2 (y_i - \gamma) = \Sigma^N_{i=1} y_i - N\gamma0=Σi=1N​2(yi​−γ)=Σi=1N​yi​−Nγ
F0=γ=Σi=1NyiNF_0 = \gamma = \frac{\Sigma^N_{i=1} y_i}{N}F0​=γ=NΣi=1N​yi​​
ri,m=−∂L(yi,F(Xi))∂F(Xi)F(X)=Fm−1(X)  for i = 1, 2, ..., Nr_{i,m} = -\frac{\partial L(y_i, F(X_i))}{\partial F(X_i)}_{F(X) = F_{m-1}(X)} \;\text{for i = 1, 2, ..., N}ri,m​=−∂F(Xi​)∂L(yi​,F(Xi​))​F(X)=Fm−1​(X)​for i = 1, 2, ..., N

This looks scary but it's actually just a derivative of the loss function with respect to the predicted value which comes from step 1. The predicted value from the previous tree is γ\gammaγ.

ri,m=2∗(yi−γ)r_{i, m} = 2 * (y_i - \gamma)ri,m​=2∗(yi​−γ)

Now we need to fit a regression tree to the ri,mr_{i,m}ri,m​ values and create terminal regions Rj,mR_{j, m}Rj,m​ for j = 1 to JmJ_mJm​. The JmJ_mJm​ is simply the number of leave nodes for the regression tree.

For each terminal region or leaf node, we need to compute the γ\gammaγ value again. However, there's a small difference here. We have multiple leaf nodes for the second tree. We need to compute γ\gammaγ for all of them.

for j = 1, 2, ... Jm  compute  γj,m=argminγΣL(yi,Fm−1(xi)+γ)\text{for j = 1, 2, ...}\,J_m\;\text{compute}\; \gamma_{j, m} = \text{argmin}_{\gamma} \Sigma L(y_i, F_{m-1}(x_i) + \gamma)for j = 1, 2, ...Jm​computeγj,m​=argminγ​ΣL(yi​,Fm−1​(xi​)+γ)
L=12(yobserved−ypredicted)2L = \frac{1}{2} (y_{observed} - y_{predicted})^2L=21​(yobserved​−ypredicted​)2

I will skip the chain rule and derivatives, and jump to the conclusion. We will discover that the new γ\gammaγ is equal to the mean difference between observed value and previous tree predicted value.

γj,m=1NΣ(yi−Fm−1(xi))  sum over all samples that belong to this node\gamma_{j, m} = \frac{1}{N}\Sigma( y_{i} - F_{m-1}(x_{i}))\;\text{sum over all samples that belong to this node}γj,m​=N1​Σ(yi​−Fm−1​(xi​))sum over all samples that belong to this node
Fm(X)=Fm−1(X)+αΣj=1Jmγj,m  for leaves that X belongs toF_m(X) = F_{m-1}(X) + \alpha \Sigma^{J_m}_{j=1} \gamma_{j, m}\;\text{for leaves that X belongs to}Fm​(X)=Fm−1​(X)+αΣj=1Jm​​γj,m​for leaves that X belongs to

The final step is that the output of the model is F100F_{100}F100​ if we set M=100M = 100M=100. There we have it, gradient boosted random forest. Notice the similarity between Gradient Boost and AdaBoost? Gradient Boost has trees that are multiple layers deep depending on how many max leaf nodes we set of the model. Gradient Boost uses gradient or derivative to make incremental directional improvement while AdaBoost uses weights and resampling to guide the search.

log(odds)=NtrueNfalse\text{log(odds)} = \frac{N_{true}}{N_{false}}log(odds)=Nfalse​Ntrue​​
log(42)=0.693log(\frac{4}{2}) = 0.693log(24​)=0.693
P=elog(odds)1+elog(odds)P = \frac{e^{\text{log}(odds)}}{1 + e^{\text{log}(odds)}}P=1+elog(odds)elog(odds)​
P=21+2=0.667P = \frac{2}{1 + 2} = 0.667P=1+22​=0.667
Residual=Pobserved−Ppredicted=1−0.667=0.333\text{Residual} = \text{P}_{observed} - \text{P}_{predicted} = 1 - 0.667 = 0.333Residual=Pobserved​−Ppredicted​=1−0.667=0.333
logit=ΣiResidualiΣiPrevious Probabilityi⋅(1−Previous Probailityi)logit = \frac{\Sigma_{i} \text{Residual}_i}{\Sigma_{i} \text{Previous Probability}_i \cdot (1 - \text{Previous Probaility}_i)}logit=Σi​Previous Probabilityi​⋅(1−Previous Probailityi​)Σi​Residuali​​
Residual Tree
Two Tree
Three Tree