Understanding how backpropagation and gradients are used to train neural networks.
Using the first or second derivative of a function to find its sequentially approximate the local extremes of a function issimple, but with multilayer perceptrons, as the network becomes deeper, these derivatives become harder and harder to compute.
The most important tool used for this is backpropagation via the chain rule. Where gradients can be calculated for a whole layer of a network and then used as the basis to calculate the gradients of the previous layer.
Chain Rule
The simple MLP \( L \circ f( \theta_3) \circ f( \theta_2 ) \circ f( \theta_1 ) ( x ) \), can be represented as a graph:
Where:
First the graph is evaluated forwards, calculating the value at each layer based on x
and then the gradients are calculated backwards using the chain rule
Start at the loss function, updating and storing the gradients of its direct children and marking them as completed.
Then update nodes who have downstream nodes that are all marked as completed.
Repeating this iteration until all nodes are marked as completed the gradient of the loss with respect to each weight can be calculated.
This allows any graph no matter how complicated to be trained using backpropagation, as long as it has no loops, and the loss function is accessible from all nodes.
If each neuron has activation
\( \sigma(x) = \frac{1}{1+e^{-x}} \) \( \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) \)
\( L(a_3) = \frac{1}{2} \Sigma (y_i - a_{3i}) \) \( L'(a_3) = a_3 - y \)
at all positions in the network:
\( \frac{\partial L}{\partial a_{3i} } \) | = \( a_{3i} -y_i \) | \( \frac{\partial L}{\partial \theta_{3ij}} \) | = \( \frac{\partial L}{\partial a_{3i}} \) \( \frac{\partial a_{3i}}{\partial \theta_{3ij}} \)= \( (a_{3i}-y_i) \) \( (a_{2j}) \) |
\( \frac{\partial L}{\partial a_{2i}} \) | = \( \frac{\partial L}{\partial a_{3}} \)\( \frac{\partial a_{3}}{a_{2i}}\) =\( (y) \)\( ( \theta_{3i}) \) | \( \frac{\partial L}{\partial z_{2i}} \) | = \( \frac{\partial L}{\partial a_{2i}} \)\( \frac{\partial a_{2i}}{\partial z_{2i}} \) =\( (y \cdot \theta_{3i}) \)\( \sigma'(z_{2i}) \) |
\( \frac{\partial L}{\partial \theta^{2ij}} \) | = \( \frac{\partial L}{\partial z_{2i}} \)\( \frac{\partial z_{2i}}{\partial \theta^{2ij}} \) =\( (y \cdot \theta_{3i} \cdot \sigma'(z_{2i})) \)\( ( a_{1j} )\) | ||
\(\frac{\partial L}{\partial a_{1i}} \) | = \( \frac{\partial L}{\partial z_{2}} \)\( \frac{\partial z_{2}}{\partial a_{1i}} \) =\( (y \cdot \theta_{3} \cdot \sigma'(z_{2})) \)\( ( \theta_{2i} ) \) | ||
\( \frac{\partial L}{\partial z_{1i}} \) | = \( \frac{\partial L}{\partial a_{1i}} \)\( \frac{\partial a_{1i}}{\partial z_{1i}} \) =\( (y \cdot \theta_{3} \cdot \sigma'(z_{2}) \cdot \theta_{2i} ) \)\( ( \sigma'(z_{1i}) )\) | ||
\( \frac{\partial L}{\partial \theta_{1ij}} \) | = \( \frac{\partial L}{\partial z_{1i}} \)\( \frac{\partial z_{1i}}{\partial \theta_{1ij}} \) =\( (y \cdot \theta_{3} \cdot \sigma'(z_{2}) \cdot \theta_{2i} \cdot \sigma'(z_{1i}) ) \)\( ( x_{j} )\) |
When attempting to learn a finite set of labels, a Classifier with one-hot encoding can be used.
One Hot Encoding
Where the index of the 1 is the label of the image.
This uses
Cross Entropy Loss
Where:
This rewards the model for picking the correct image, but does not directly penalize it for picking the wrong ones.
The final layer of the classifier is a
Softmax Activation
Which elevates the highest value and suppresses the others, ensuring the sum of probability is 1. This ensures the modes does not just output high estimates for every class.
When these two are combined, it yields a simpler equation for the final two layers of the classifier.
Where:
Here is what this network looks like:
Forward Pass
\( \frac{\partial L}{\partial p_1 } \) | = \( -y_i + \sigma(p_i) \) |
\( \frac{\partial L}{\partial c_i} \) | = \( \frac{\partial L}{\partial p_i} \) \( \frac{\partial p_i}{\partial c_i2} \)= \( (-y_i + \sigma(p_i)) \) \( (\frac{1}{2} ) \) |
\( \frac{\partial L}{\partial \theta_i } \) | = \( \frac{\partial L}{\partial c } \)\( \frac{\partial c}{ \partial \theta_i }\) =\( (-y+\sigma(p)/2) \)\( ( x_{i-1} + x_i + x_{i+1} ) \) |
\( \frac{\partial L}{\partial x_i} \) | = \( \frac{\partial L}{\partial c} \)\( \frac{\partial c}{\partial x_i} \) =\( (-y+\sigma(p)/2) \)\( \theta \) |