Random Forest

Random forest is a popular ensemble machine learning technique. Essentially it uses a batch of decision tree and bootstrap aggregation (bagging) to reduce variance. A single decision tree leads to high bias and low variance. A forest of decision tree will lead to high variance with low bias. The bagging technique will address the variance problem.

We can build a decision tree easily using sklearn and achieve >80% accuracy on MNIST dataset using all pixel values as features.

import numpy as np
from sklearn import tree
from keras.datasets import mnist
from keras.utils import to_categorical

(x_train, y_train), (x_test, y_test) = mnist.load_data()

N, H, W = x_train.shape
x = x_train.reshape((N,H*W)).astype('float') / 255
y = to_categorical(y_train, num_classes=10)

model = tree.DecisionTreeClassifier()
model.fit(x, y)

N, H, W = x_test.shape
x = x_test.reshape((N,H*W)).astype('float') / 255
y = to_categorical(y_test, num_classes=10)

model.score(x, y)

Random forest outperforms decision tree by having 100 tree. Obviously, it will take a longer time to train.

Decision Tree

A decision tree is very intuitive. It is a direct representation of how we make decision on a day-to-day basis. Here I borrow an example from Christopher Gray. Suppose that I am deciding whether to accept a job offer.

job offer

I ask myself a series of questions, e.g. is the salary above $50,000? If no, I can immediately decline the offer. If yes, then I will ask myself another question, does it require me longer than 1 hour of commute time? If yes, then I will immediately decline the offer. If not, then I will ask another question again. This process goes on until there is no more questions to ask.

Tree Node

There are two types of tree node in a decision tree.

Prediction node(s)

These are the are leaf nodes of the tree. They are the final answer to the ultimate question, i.e. should I accept this job offer?

Decision node(s)

These are the parent nodes of the tree. They represent a question or a rule that splits the into two different outcome/branches.

How to Split

Now the question is, how do we know when to branch at each node and when to stop branching? Also, how do we know what question to ask for the splitting? Well, the essence of splitting comes down to reducing impurity. In the training phase, we are given a set of data.

For example, let's say we have a basket of fruits, which contains apples, grapes, and lemons. Each of them has a set of attributes or features, like color and diameter.

Color
Diameter
Fruit

Green

3

Apple

Red

3

Apple

Green

1

Grape

Purple

1

Grape

Yellow

3

Lemon

At every decision node, we ask a question to split the data into two batches. For example, a question we may ask is, is it red? For all the fruits that are red will be grouped together, and the rest will be put in another group.

Color
Diameter
Fruit
Batch

Green

3

Apple

0

Red

3

Apple

0

Green

1

Grape

1

Purple

1

Grape

1

Yellow

3

Lemon

0

Now the splitting is clear. We have 2 apples and 1 lemon in one group, while we have grapes in another group. We will try to split until the impurity is zero, i.e. all data are homogeneous, unless there is a maximum depth restriction. In this case, the grape group has an impurity of zero because everything is grape in that group.

CART Gini Index

The definition of pure is that if we select two items from a pure population, the probaility of them being of same class has to be one. Gini index is a measurement of how impure a population is, ranging from 0 being pure to 1 being impure.

G=1−ΣiCP(i)2G = 1 - \Sigma_{i}^{C} P(i)^{2}

The procedure to benchmark a branching decision is.

  1. Calculate Gini index for left and right sub-node.

  2. Use weighted average on the two indices to decide what is the impurity.

ID3

The core algorithm for building decision trees is called ID3. This algorithm eploys a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses entropy and information gain to construct a decision tree.

Entropy is defined as follows.

E=Σi=1C−P(i)∗log2P(i)E = \Sigma_{i = 1}^{C} -P(i) * log_{2} P(i)

The procedure to benchmark a branching decision is.

  1. Calculate the entropy before the split happens.

  2. Calculate the entropy for left and right sub-branch.

  3. Using the prior entropy and weighted sum of the sub-entropies, we can come up with information gain.

IG=E0−Σi2P(i)EiIG = E_{0} - \Sigma_{i}^{2} P(i)E_{i}

Implementation Highlights

Source code can be found on GitHub, by clicking the top navigation bar.

I am not going to dive into the fine details of how the class is implemented because it's there on GitHub. However, there are couple highlights.

Partition

There is a function called partition. It accepts a batch of data and a rule as argument. The rule is acting as a question, e.g. "Is this Green?". If a data point answers yes to the question, then it is being partitioned into the true set, else false set.

Split Rule

As mentioned above, a rule is a question. It computes on two type of values i.e. numeric and string types. For string typed attribute, the rule or question is asking whether input is equal to a string value. For example, Is it green?, the string "green" is the value we are seeking. For numeric typed attribute, the rule or question is asking whether an input is greater than the numerical value. For example, Is diameter greater than 3 cm?.

Rule Selection

So how do we decide which rule/question to use at each decision node? We use Gini index or information gain. We iterate through each attribute and perform partitioning on each attribute. Then we compute the purity gain or information gain from each partition and select the partition that gives the highest purity gain or information gain.

Forest of Decision Tree

As the name implies, a random forest is simply a collection of randomly generated decision tree.

Pseudo-code

Let's assume that the dataset has contained N examples and K features. The input is essentially of shape (N, K).

  1. Randomly select k features from total K features, where k << K

  2. Construct a decision tree using k features and give it a threshold for splitting

    • Example threshold: minimum impurity decrease, a.k.a minimum purity gain.

    • Example threshold: maximum depth, i.e. should not split when the tree has reached maximum depth.

    • Example threshold: minimum sample split, i.e. minimum number of samples required to split a decision node.

    • Optional argument: whether to bootstrap or not.

  3. Build a forest by repeating Step No.2 for n_estimators times.

  4. Prediction is made through majority vote from the tree.

Put Everything Together

First, create a helper function for loading iris species data from CSV.

Then load the data and shuffle them because the original data set is sorted by iris species. We want to do a 80:20 split on the data set to get a training set and a validation set.

Last updated