Machine Learning Notebook
  • Introduction
  • Supervised Learning
    • Basic Overview
      • Numpy Basics
      • Loss Functions
      • Evaluation Metrics
    • Convolutional Neural Network
      • Convolution Operation
      • Transpose Convolution Operation
      • Batch Normalization
      • Weight Initialization
      • Segmentation
    • Diffusion
      • KL Divergence
      • Variational Inference
      • Variational Autoencoder
      • Stable Diffusion Overview
      • Stable Diffusion Deep Dive
    • Naive Bayes
    • Decision Tree
      • Random Forest
      • Gradient Boosting
    • Natural Language Processing
      • Word2Vec
    • Search
      • Nearest Neighbor Search
    • Recommender
      • Singular Value Decomposition
      • Low Rank Matrix Factorization
      • Neural Collaborative Filtering
      • Sampling Bias Corrected Neural Modeling for Large Corpus Item Recommendations
      • Real-time Personalization using Embeddings for Search Ranking
      • Wide and Deep Learning for Recommender Systems
    • Recurrent Neural Network
      • Vanilla Recurrent Neural Network
      • LSTM Recurrent Neural Network
  • Unsupervised Learning
    • Clustering
      • Spectral Clustering
    • Reinforcement Learning
      • Deep Q Learning
      • Policy Gradients
  • SageMaker
    • Population Segmentation with PCA and KMeans
    • Fraud Detection with Linear Learner
    • Time Series Forecast with DeepAR
    • PyTorch Non-linear Classifier
Powered by GitBook
On this page
  • Definition
  • Example
  • Intuition
  1. Supervised Learning
  2. Diffusion

KL Divergence

PreviousDiffusionNextVariational Inference

Last updated 2 years ago

We want to quantify the difference between probability distributions for a given random variable. For example, we'd be interested in the difference between an actual and observed probability distribution.

Kullback-Leibler divergence calculates a score that measures the divergence of one probability distribution from another.

Definition

The KL divergence between two distributions Q and P is often stated as follows.

KL(P∥Q)\text{KL}(P \| Q)KL(P∥Q)

The operator indicates P's divergence from Q, which is not the same as Q's divergence from P. Thus, the operation is not commutative.

KL(P∥Q)≠KL(Q∥P)\text{KL}(P \| Q) \neq \text{KL}(Q \| P)KL(P∥Q)=KL(Q∥P)

KL divergence can be calculated as the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P .

KL(P∥Q)=Σx∈XP(x)⋅log(P(xQ(x))\text{KL}(P \| Q) = \Sigma_{x \in X} P(x) \cdot log\left( \frac{P(x}{Q(x)} \right)KL(P∥Q)=Σx∈X​P(x)⋅log(Q(x)P(x​)

If there is no divergence at all, the log term becomes 0 and the sum is 0. The log can be base-2 to give units in bits.

If we are attempting to approximate an unknown probability distribution, then the target probability distribution from data is P and Q is our approximation of the distribution. In this case, the KL divergence summarizes the number of additional bits required to represent an event from the random variable. The better our approximation, the less additional information is required.

Example

Consider a random variable like drawing a colored marble from a bag. We have 3 colors and two different probability distribution.

events = ['red', 'green', 'blue']
p = [0.10, 0.40, 0.50]
q = [0.80, 0.15, 0.05]

We can calculate the KL divergence using formula above.

import numpy as np
def kl_divergence(p, q):
    div = 0
    for i in range(3):
        div += p[i] * np.log2(p[i]/q[i])
    return div
print("KL(P||Q)", kl_divergence(p, q), "bits")
print("KL(Q||P)", kl_divergence(q, p), "bits")
KL(P||Q) 1.9269790471552186 bits
KL(Q||P) 2.0216479703638055 bits

Intuition

If I have two coins, one fair and one weighted, KL divergence tells me how much information do I need to distinguish the weighted coin from the fair coin.

Intuitively Understanding the KL Divergence