Weight Initialization
Zeros Are Bad
In general we should never initialize the weights of our network to be all zeros.
Because when the network starts training, all neurons are practically doing same thing and all weights will receive identical updates, which greatly reduces the power of a neural network.
Normally Distributed
One of the naive approaches is to use small randoms that are normally distributed, e.g. guassian random numbers with zero mean and $10^{-2}$ standard deviations
W = 0.01 * np.random.randn(fan_in, fan_out)Glossary: fan_in is the a term that defines the maximum number of inputs that a system can accept. fan_out is a term that defines the maximum number of inputs that the output of a system can feed to other systems.
import numpy as np
def forward_prop(hidden_layer_sizes, weight_init_func):
"""This is a simple experiment on showing how weight initialization can impact activation through deep layers
"""
# Extract the first hidden layer dimension
h1_dim = hidden_layer_sizes[0]
# Randomly initialize 1000 inputs
inputs = np.random.randn(1000, h1_dim)
nonlinearities = ['tanh'] * len(hidden_layer_sizes)
act_func = {
'relu': lambda x: np.maximum(0, x),
'tanh': lambda x: np.tanh(x)
}
hidden_layer_acts = dict()
for i in range(len(hidden_layer_sizes)):
if i == 0:
X = inputs
else:
X = hidden_layer_acts[i - 1]
fan_in = X.shape[1]
fan_out = hidden_layer_sizes[i]
W = weight_init_func(fan_in, fan_out)
H = np.dot(X, W)
H = act_func[nonlinearities[i]](H)
hidden_layer_acts[i] = H
hidden_layer_means = [np.mean(H) for i, H in hidden_layer_acts.items()]
hidden_layer_stds = [np.std(H) for i, H in hidden_layer_acts.items()]
return hidden_layer_acts, hidden_layer_means, hidden_layer_stds
def small_random_init(fan_in, fan_out):
return 0.01 * np.random.randn(fan_in, fan_out)
hidden_layer_sizes = [500] * 10
hidden_layer_acts, hidden_layer_means, hidden_layer_stds = forward_prop(hidden_layer_sizes, small_random_init)
for i, H in hidden_layer_acts.items():
print('Hidden layer %d had mean %f and std %f' % (i + 1, hidden_layer_means[i], hidden_layer_stds[i]))Notice that when we have a 10-layer deep network, all activations approach zero at the end with this set of initialization. Well, how about increase the standard deviation?


Notice that almost all neurons completely saturated to either -1 or 1 in every layer. This means gradients will be all zero and we won't be able to perform any learning on this network.
Xavier Initialization
So what is this saying? Normally random distributed numbers do not work with deep learning weight initialization. A good rule of thumb is to try Xaiver initialization from the paper Xiaver Initialization (Glorot et al. 2010).


Now everything is nice and normally distributed!
Last updated