Explore how to initialize Neural Network weights to avoid gradient explosion/vanishing.
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot and Yoshua Bengio show that initializing the weights so that the distribution of the activations and gradients remain constant through the layers helps to alleviate these issues.
Here we show that the Xavier/Glorot weight initialization for a network with tanh activation is
where \(n_l\) is the number of inputs in layer \(l\)
For the Xavier/Glorot initialization, the goal is to keep the variance of the activation and gradient of each layer the same:
A single dense layer with the tanh activation function can be written as:
We assume that the input \(x\) is zero-centered and has unit variance:
For the case that the network has a linear response function, ant the bias are incorporated into the input vector.
Considering each neuron and observation independently, with the weight vector \( w \) and observation \( x \) the output variance is:
\( w_i \) and \( x_i \) are independent and zero-centered, so their expectations are separable and zero:
\( x^2 \) and \( w^2 \) are also independent:
Putting it back together:
Setting \(\text{Var}(x)\) and \(\text{Var}(y)\) equal to each other:
The same can be done for the gradient resulting in the layer-wise initialization:
where the compromise is to use the average of the input and output:
The \( tanh \) activation function does not have the same properties, but it behaves similar to linearly near the origin, so its effects are negligible.
Experiment with different weight initialization techniques and observe their effects on neural network training:
Generate Random Data N(0, 1):
Activation Function:
Weight Initialization Method:
Number of Layers: 6