Using taylor expansions to increase the speed of convergence in gradient descent.
Problem
A hypothetical one dimensional neural network where the loss function is a quadratic function of the weights:
\[\Large
L(w): \mathbb{R} \rightarrow \mathbb{R} \
\]
And learning update:
\[\Large
w_{t+1} = w_t - \frac{dL}{dw}(w_t) \
\]
Assuming that there is a single global extreme
\(f(w_*)\), how do you determine the optimal learning rate:
\[ \Large \quad \eta_* \text{st:} \quad w_t+ \eta_* \frac{dL}{dw}(w_t) = w_* \]
that allows you to reach the global extreme in one step?
Solution
The loss function is quadratic, so any derivative beyond the second is zero.
This means that the Taylor Expansion of the derivative of the loss function at point \( w_* \) can be written using only two terms:
\[\Large
\frac{dL}{dw}(w_*) = \frac{dL}{dw}(w_t) + \frac{d^2L}{dw^2}(w_t)(w_* - w_t)
\]
\( w_* \) is a
unique maximum on a smooth curve, so the first derivative at that point is zero:
\[\Large
\frac{dL}{dw}(w_*) = 0
\]
substituting this into the
Taylor Expansion, yields:
\[\Large
0 = \frac{dL}{dw}(w_t) + \frac{d^2L}{dw^2}(w_t)(w_* - w_t)
\]
and solving for
\( w_* \):
\[\Large
w_* = w_t -\frac{\frac{dL}{dw}(w_t)}{\frac{d^2L}{dw^2}(w_t)}
\]
which can be written in terms of the original learning step:
\[\Large
w_* = w_t - \frac{1}{\frac{d^2L}{dw^2}(w_t)} \frac{dL}{dw}(w_t)
\]
\[\Large
w_* = w_t - \eta \frac{dL}{dw}(w_t)
\]
\[\Large
\eta = \frac{1}{\frac{d^2L}{dw^2}(w_t)}
\]
It can be assumed that
\( \frac{d^2L}{dw^2}(w_t) \) is not zero. If it is, then the equation is either monotonic or constant, and does not have a global extreme.
If L is a field over a vector space:
\[\Large L: \mathbb{R}^n \rightarrow \mathbb{R} \]
the math is slightly more complicated, but the same principle applies.
The Taylor Expansion is truncated at the second term under the assumption that each term of the loss function is a component of at most two dimensions.
\[\Large
\nabla L (w_*) = \nabla L(w_t) + H \big[L(w_t) \big] (w_* - w_t)
\]
Where
\( H \) is the
Hessian of the loss function.
The same assumption can be made about the gradient at the local extreme:
\[\Large \nabla L = \overrightarrow{0} \]
and substituted in to the
Taylor Expansion:
\[\Large
\overrightarrow{0} = \nabla L(w_t) + H \big[L(w_t) \big] (w_* - w_t)
\]
\[\Large
w_* = w_t - H^{-1} \big[L(w_t) \big] \nabla L(w_t)
\]
\[\Large
\eta = H^{-1} \big[L(w_t) \big]
\]
If any of the eigenvalues of
\( H^{-1} \) are 0, the directional derivative of its corresponding eigenvector is 0. This means that the loss function is either constant or monotonic in that direction, and does not have a global extreme.
Therefore \( H \) has a full set of non-zero eigenvalues and is invertible.
Interactive Demonstration
f(w) =-1.0w² + -2.0w + -3.0
f'(w) =-2.0w + -2.0
f''(w) =-2.0
η = 1/f''(w₀) = -0.5
w₀ = -4.0, f(w₀) = -27.
f'(w₀) = -10.
w₁ = -1.0, f(w₁) = -2.0
w* = -1.0, f(w*) = -2.0
Takeaways
- As the magnitude of the derivatives beyond the second increase, the estimation becomes less accurate
- Functions rarely are only second order, but incorporating more derivatives into gradient descent can speed up convergence
- The cost of calculating second order derivatives for fields grows with the dimension
- Computing the Hessian and its inverse, can be more expensive than multiple steps of first gradient descent