Batch normalization is invented and widely popularized by the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In deep neural network, activations between neural layers are extremely dependent on the parameter initialization, which in turn affects how outputs are backprop into each layer during training. Poor initialization can greatly affect how well a network is trained and how fast it can be trained. Networks train best when each layer has an unit Gaussian distribution for its activations. So if you really want unit Gaussian activations, you can make them so by applying batch normalization to every layer.
Basically, batch normalization is a powerful technique for decoupling the weight updates from parameter initialization. Quoted from the paper, batch normalization allows us to use much higher learning rates and be less careful about initialization. Let's consider a batch of activations at some layer, we can make each dimension (denoted by k) unit Gaussian by applying:
x^(k)=Var[x(k)]x(k)−E[x(k)]
Each batch of training example has dimension D. Compute the empirical mean and variance independently for each dimension by using all the training data. Batch normalization is usually inserted after fully connected or convolutional layers and before nonlinearity is applied. For the convolutional layer, we are basically going to have one mean and one standard deviation per activation map that we have. And then we are going to normalize across all of the examples in the batch of data.
Avoid Constraints by Learning
If we have a tanh layer, we don't really want to constraint it to the linear regime. The act of normalization might force us to stay within the center, which is known as the linear regime. We want flexibility so ideally we should learn batch normalization as a paramter of the network. In other words, we should insert a parameter which can be learned to effectively cancel out batch normalization if the network sees fit.
We will apply the following operation to each normalized vector:
y(k)=γ(k)x^(k)+β(k)
Such that the network can learn
γ(k)=Var[x(k)]
β(k)=E[x(k)]
And effectively recover the identity mapping as if you didn't have batch normalization, i.e. to cancel out the batch normalization if the network sees fit.
Procedure
Inputs: Values of x over a mini-batch: B = {x1...xm}
Outputs: {yi=BNγ,β(xi)}
Find mini-batch mean:
μB=m1i=1∑mxi
Find mini-batch variance:
σB2=m1i=1∑m(xi−μB)2
Normalize:
xi^=σB2+ϵxi−μB
Scale and shift:
yi=γxi^+β=BNγ,β(xi)
Benefits
Improves gradient flow through the network
Allows higher learning rates
Reduces the strong dependence on initialization
Acts as a form of regularization in a funny way, and slightly reduces the need for dropout
Detailed Implementation & Derivation
Here comes the derivation; much of the derivation comes from the paper itself and also from Kevin Zakka's blog on Github.
Notations
BN stands for batch normalization
x is the input matrix/vector to the BN layer
μ is the batch mean
σ2 is the batch variance
ϵ is a small constant added to avoid dividing by zero
x^ is the normalized input matrix/vector
y is the linear transformation which scales x by γ and β
f represents the next layer after BN layer, if we assume a forward pass ordering
Forward Pass
Forward pass is very easy intuitively and mathematically.
First we find the mean across a mini-batch of training examples
μ=m1i=1∑mxi
Find the variance across the same mini-batch of training examples
σ2=m1i=1∑m(xi−μ)2
And then apply normalization
xi^=σ2+ϵxi−μ
Finally, apply linear transformation with learned parameters to enable network to recover identity. In case we wonder why do we need to do this.
Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, w make sure that the transformation inserted in the network can represent the identity transform.
yi=γxi^+β=BNγ,β(xi)
If γ is 1 and beta is 0 then the linear transformation is an identity transformation.
import numpy as npdefbatch_norm_forward(x,gamma,beta,bn_params): eps = bn_params.get('eps', 1e-5) momentum = bn_params.get('momentum', 0.9) mode = bn_params.get('mode', 'train') N, D = x.shape running_mean = bn_params.get('running_mean', np.zeros(D, dtype=x.dtype)) running_var = bn_params.get('running_var', np.zeros(D, dtype=x.dtype)) y =Noneif mode =='train': mean = x.mean(axis=0) var = x.var(axis=0) x_norm = (x - mean) / np.sqrt(var + eps) y = x_norm * gamma + beta# Update running mean and running variance during training time running_mean = momentum * running_mean + (1- momentum) * mean running_var = momentum * running_var + (1- momentum) * varelif mode =='test':# Use running mean and runningvariance for making test predictions x_norm = (x - running_mean) / np.sqrt(running_var + eps) y = x_norm * gamma + betaelse:raiseValueError('Invalid forward pass batch norm mode %s'% mode) bn_params['running_mean']= running_mean bn_params['running_var']= running_varreturn yx = np.random.rand(4, 4)bn_params ={}y =batch_norm_forward(x, 1, 0, bn_params)