Understanding how backpropagation and gradients are used to train neural networks.

Introduction

Using the first or second derivative of a function to find its sequentially approximate the local extremes of a function issimple, but with multilayer perceptrons, as the network becomes deeper, these derivatives become harder and harder to compute.

The most important tool used for this is backpropagation via the chain rule. Where gradients can be calculated for a whole layer of a network and then used as the basis to calculate the gradients of the previous layer.

Chain Rule

\[\Large f(g(x))' = f'(g(x)) \cdot g'(x) \]

Backpropogation

The simple MLP \( L \circ f( \theta_3) \circ f( \theta_2 ) \circ f( \theta_1 ) ( x ) \), can be represented as a graph:

Where:

\( L \) is the loss function
\( \theta_i \) are the weights for neuron i
\( f \) is the activation function of the neuron
\( g \) is a neuron without activation
\( a_i \) is the output to layer \( i \)
\( z_i \) is the output to layer \( i \) before activation

First the graph is evaluated forwards, calculating the value at each layer based on x

\[\Large a_1 = f( \theta_1, x ) \]\[\Large a_2 = f( \theta_2, a_1 ) \]\[\Large a_3 = f( \theta_3, a_2 ) \]\[\Large L = L( a_3 ) \]

and then the gradients are calculated backwards using the chain rule

\[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial a_3} \cdot \frac{\partial a_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial x} \]

\[\Large \frac{\partial L}{\partial a_3} = L'( a_3 ) \]

\[\Large \frac{\partial L}{\partial a_2} = L'( a_3 ) \cdot f'( \theta_3, a_2 ) \]

\[\Large \frac{\partial L}{\partial a_1} = L'( a_3 ) \cdot f'( \theta_3, a_2 ) \cdot f'( \theta_2, a_1 ) \]

\[\Large \frac{\partial L}{\partial x} = L'( a_3 ) \cdot f'( \theta_3, a_2 ) \cdot f'( \theta_2, a_1 ) \cdot f'( \theta_1, x ) \]

Start at the loss function, updating and storing the gradients of its direct children and marking them as completed.
Then update nodes who have downstream nodes that are all marked as completed.
Repeating this iteration until all nodes are marked as completed the gradient of the loss with respect to each weight can be calculated.
This allows any graph no matter how complicated to be trained using backpropagation, as long as it has no loops, and the loss function is accessible from all nodes.

If each neuron has activation
\( \sigma(x) = \frac{1}{1+e^{-x}} \) \( \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) \)
\( L(a_3) = \frac{1}{2} \Sigma (y_i - a_{3i}) \) \( L'(a_3) = a_3 - y \)
at all positions in the network:

\( \frac{\partial L}{\partial a_{3i} } \)	= \( a_{3i} -y_i \)	\( \frac{\partial L}{\partial \theta_{3ij}} \)	= \( \frac{\partial L}{\partial a_{3i}} \) \( \frac{\partial a_{3i}}{\partial \theta_{3ij}} \)= \( (a_{3i}-y_i) \) \( (a_{2j}) \)
\( \frac{\partial L}{\partial a_{2i}} \)	= \( \frac{\partial L}{\partial a_{3}} \)\( \frac{\partial a_{3}}{a_{2i}}\) =\( (y) \)\( ( \theta_{3i}) \)	\( \frac{\partial L}{\partial z_{2i}} \)	= \( \frac{\partial L}{\partial a_{2i}} \)\( \frac{\partial a_{2i}}{\partial z_{2i}} \) =\( (y \cdot \theta_{3i}) \)\( \sigma'(z_{2i}) \)
\( \frac{\partial L}{\partial \theta^{2ij}} \)	= \( \frac{\partial L}{\partial z_{2i}} \)\( \frac{\partial z_{2i}}{\partial \theta^{2ij}} \) =\( (y \cdot \theta_{3i} \cdot \sigma'(z_{2i})) \)\( ( a_{1j} )\)
\(\frac{\partial L}{\partial a_{1i}} \)	= \( \frac{\partial L}{\partial z_{2}} \)\( \frac{\partial z_{2}}{\partial a_{1i}} \) =\( (y \cdot \theta_{3} \cdot \sigma'(z_{2})) \)\( ( \theta_{2i} ) \)
\( \frac{\partial L}{\partial z_{1i}} \)	= \( \frac{\partial L}{\partial a_{1i}} \)\( \frac{\partial a_{1i}}{\partial z_{1i}} \) =\( (y \cdot \theta_{3} \cdot \sigma'(z_{2}) \cdot \theta_{2i} ) \)\( ( \sigma'(z_{1i}) )\)
\( \frac{\partial L}{\partial \theta_{1ij}} \)	= \( \frac{\partial L}{\partial z_{1i}} \)\( \frac{\partial z_{1i}}{\partial \theta_{1ij}} \) =\( (y \cdot \theta_{3} \cdot \sigma'(z_{2}) \cdot \theta_{2i} \cdot \sigma'(z_{1i}) ) \)\( ( x_{j} )\)

CNN Classifier

When attempting to learn a finite set of labels, a Classifier with one-hot encoding can be used.

One Hot Encoding

\[\Large y = [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] \]

Where the index of the 1 is the label of the image.
This uses

Cross Entropy Loss

\[\Large L(a_i, y) = - \Sigma y_i \cdot log( a_i ) = -log( a_* ) \]

Where:

\( y_i \) is the one hot vector encoding of the correct label
\( a_i \) is the models array of probability estimations, the last layer of the MLP
\( a_* \) is the estimate at the correct label

This rewards the model for picking the correct image, but does not directly penalize it for picking the wrong ones.

The final layer of the classifier is a

Softmax Activation

\[\Large \sigma(z_i) = \frac{e^{z_i}}{ \Sigma_{j=1}^n e^{z_j} } \]

Which elevates the highest value and suppresses the others, ensuring the sum of probability is 1. This ensures the modes does not just output high estimates for every class.

When these two are combined, it yields a simpler equation for the final two layers of the classifier.

\[\Large L( \sigma ( z_i) , y ) = -log( \sigma ( z_* ) ) = -z_* + log( \Sigma e^{z_i} ) \]\[\Large \frac{\partial L}{\partial z_i} = -y_i + \sigma( z_i ) \]

Combining this with a 1D convolutional neural network for signal processing yields the network equation:

\[\Large L \circ \sigma \circ p \circ c( \theta_1 ) ( x ) = -y_* + log( \Sigma e^{z_i} ) \]

Where:

\( p \) is average pooling with stride 2
\( c \) is a 1D 0-pad convolution with a kernel size of 3
\( \theta_1 \) are the weights for the convolution (size 3)
\( x \) is input of size 4

Here is what this network looks like:

Forward Pass

\[\Large c_i = x_{i-1} \theta_0 + x_i \theta_1 + x_{i+1} \theta_2 \]\[\Large p_i = \frac{c_{2i} + c_{2i+1}}{2} \]\[\Large \sigma_i = \frac{e^{p_i}}{e^{p_0} + e^{p_1}} \]\[\Large L = -z_* + log( e^{p_0} + e^{p_1} ) \]

\( \frac{\partial L}{\partial p_1 } \)	= \( -y_i + \sigma(p_i) \)
\( \frac{\partial L}{\partial c_i} \)	= \( \frac{\partial L}{\partial p_i} \) \( \frac{\partial p_i}{\partial c_i2} \)= \( (-y_i + \sigma(p_i)) \) \( (\frac{1}{2} ) \)
\( \frac{\partial L}{\partial \theta_i } \)	= \( \frac{\partial L}{\partial c } \)\( \frac{\partial c}{ \partial \theta_i }\) =\( (-y+\sigma(p)/2) \)\( ( x_{i-1} + x_i + x_{i+1} ) \)
\( \frac{\partial L}{\partial x_i} \)	= \( \frac{\partial L}{\partial c} \)\( \frac{\partial c}{\partial x_i} \) =\( (-y+\sigma(p)/2) \)\( \theta \)

Interactive Demonstration

Resolution: 10

Data Configuration

Data Points: 100

Classes: 2

Data Noise: 0.1

Distribution Pattern:

Model Structure

Hidden Layers: 2

Nodes per Layer

Activation Function:

Weight Initializer:

Training Controls

Learning Rate: 0.0100

L2 Regularization: 0.0001

Training Stats:

Loss: 0.0000

Acc: 0.0%

Epoch: 0

Key Concepts Explored

Chain Rule of Calculus - The mathematical foundation that allows gradients to flow backward
Gradient Descent Optimization - How networks use gradients to adjust weights
Error Calculation and Propagation - Computing errors at the output and distributing them backward
Weight and Bias Updates - The process of adjusting parameters to minimize loss

Takeaways

Backpropagation enables efficient training by computing all gradients in a single forward and backward pass
The algorithm can face challenges like vanishing or exploding gradients in deep networks
Modern variants and optimizations of backpropagation form the basis of all deep learning systems

Backpropagation Visualizer

Introduction

Backpropogation

CNN Classifier

Interactive Demonstration

Data Configuration

Model Structure

Training Controls

Training Stats:

Key Concepts Explored

Takeaways