Explore how to initialize Neural Network weights to avoid gradient explosion/vanishing.

Problem

As Neural Network grow deeper, if the weights are not properly initialized

Vanishing gradients
Exploding gradients
Slow convergence
Getting stuck in poor local minima

In their paper

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio show that initializing the weights so that the distribution of the activations and gradients remain constant through the layers helps to alleviate these issues.

Here we show that the Xavier/Glorot weight initialization for a network with tanh activation is

\[\Large W \sim N \bigg( 0, \frac{1}{n_{l}} \bigg) \]

where \(n_l\) is the number of inputs in layer \(l\)

Solution

For the Xavier/Glorot initialization, the goal is to keep the variance of the activation and gradient of each layer the same:

\[\Large \text{Var}(y) = \text{Var}(x) \]

and:

\[\Large \text{Var}(dy) = \text{Var}(dx) \]

A single dense layer with the tanh activation function can be written as:

\[\Large y = tanh(Wx) \]

We assume that the input \(x\) is zero-centered and has unit variance:

\[\Large \text{Var}(x) = 1, E[x] = 0 \]

For the case that the network has a linear response function, ant the bias are incorporated into the input vector.

Considering each neuron and observation independently, with the weight vector \( w \) and observation \( x \) the output variance is:

\[\Large \text{Var}(y) = \text{Var}(wx) = \text{Var}( \Sigma_{i} w_i x_i ) \]

Since the weights and inputs are independently distributed, the variance of the sum is the sum of the variances:

\[\Large \text{Var}( \Sigma w_i x_i ) = \Sigma \text{Var}(w_i x_i) \]

\[\Large \Sigma \text{Var}(w_i x_i) = \Sigma E[w_i^2 x_i^2]-E[w_i x_i]^2 \]

\( w_i \) and \( x_i \) are independent and zero-centered, so their expectations are separable and zero:

\[\Large E[w_i x_i]^2 = E[w_i]^2 E[x_i]^2 = 0 \]

\( x^2 \) and \( w^2 \) are also independent:

\[\Large E[w_i^2 x_i^2] = E[w_i^2] E[x_i^2] = \text{Var}(w_i) \text{Var}(x_i) \]

Putting it back together:

\[\Large \text{Var}(y) = \Sigma \text{Var}(w_i) \text{Var}(x_i) = n_l \text{Var}(w) \text{Var}(x) \]

Setting \(\text{Var}(x)\) and \(\text{Var}(y)\) equal to each other:

\[\Large \text{Var}(y) = \text{Var}(x) \]

\[\Large \text{Var}(W) = \frac{\text{Var}(y)}{n_l\text{Var}(x)} = \frac{1}{n_l} \]

The same can be done for the gradient resulting in the layer-wise initialization:

\[\Large \text{Var}(W) = \frac{1}{n_{out}} \]

where the compromise is to use the average of the input and output:

\[\Large \text{Var}(W) = \frac{1}{n_{in} + n_{out}} \]

The \( tanh \) activation function does not have the same properties, but it behaves similar to linearly near the origin, so its effects are negligible.

Interactive Demonstration

Experiment with different weight initialization techniques and observe their effects on neural network training:

Generate Random Data N(0, 1):

Activation Function:

Weight Initialization Method:

Number of Layers: 6

2048

2048

2048

2048

2048

2048

Histogram Buckets: 20

Techniques Explored

Xavier/Glorot Initialization - Designed for layers with linear activations
He Initialization - Optimized for ReLU activation functions
Random Normal/Uniform Initialization - Traditional approaches with various drawbacks

Takeaways

Initializing with low variance causes the activations to contract towards the center, mapping from a large domain to a small one.
This means that LARGE movements at the first layer have _small effects on the results, or vanishing gradients.
The opposite effect is true with high variance, causing the activations to expand, leading to exploding gradients
Xavier/Glorot initialization leads to a more stable variance across all layers, but is still susceptible to dramatic changes in layer sizes

XG Weight Initialization

Problem

Solution

Interactive Demonstration

Techniques Explored

Takeaways