Using taylor expansions to increase the speed of convergence in gradient descent.

Problem

A hypothetical one dimensional neural network where the loss function is a quadratic function of the weights:

\[\Large L(w): \mathbb{R} \rightarrow \mathbb{R} \ \]

And learning update:

\[\Large w_{t+1} = w_t - \frac{dL}{dw}(w_t) \ \]

Assuming that there is a single global extreme \(f(w_*)\), how do you determine the optimal learning rate:

\[ \Large \quad \eta_* \text{st:} \quad w_t+ \eta_* \frac{dL}{dw}(w_t) = w_* \]

that allows you to reach the global extreme in one step?

Solution

The loss function is quadratic, so any derivative beyond the second is zero.

This means that the Taylor Expansion of the derivative of the loss function at point \( w_* \) can be written using only two terms:

\[\Large \frac{dL}{dw}(w_*) = \frac{dL}{dw}(w_t) + \frac{d^2L}{dw^2}(w_t)(w_* - w_t) \]

\( w_* \) is a unique maximum on a smooth curve, so the first derivative at that point is zero:

\[\Large \frac{dL}{dw}(w_*) = 0 \]

substituting this into the Taylor Expansion, yields:

\[\Large 0 = \frac{dL}{dw}(w_t) + \frac{d^2L}{dw^2}(w_t)(w_* - w_t) \]

and solving for \( w_* \):

\[\Large w_* = w_t -\frac{\frac{dL}{dw}(w_t)}{\frac{d^2L}{dw^2}(w_t)} \]

which can be written in terms of the original learning step:

\[\Large w_* = w_t - \frac{1}{\frac{d^2L}{dw^2}(w_t)} \frac{dL}{dw}(w_t) \]

\[\Large w_* = w_t - \eta \frac{dL}{dw}(w_t) \]

\[\Large \eta = \frac{1}{\frac{d^2L}{dw^2}(w_t)} \]

It can be assumed that \( \frac{d^2L}{dw^2}(w_t) \) is not zero. If it is, then the equation is either monotonic or constant, and does not have a global extreme.

If L is a field over a vector space:

\[\Large L: \mathbb{R}^n \rightarrow \mathbb{R} \]

the math is slightly more complicated, but the same principle applies.

The Taylor Expansion is truncated at the second term under the assumption that each term of the loss function is a component of at most two dimensions.

\[\Large \nabla L (w_*) = \nabla L(w_t) + H \big[L(w_t) \big] (w_* - w_t) \]

Where \( H \) is the Hessian of the loss function.

The same assumption can be made about the gradient at the local extreme:

\[\Large \nabla L = \overrightarrow{0} \]

and substituted in to the Taylor Expansion:

\[\Large \overrightarrow{0} = \nabla L(w_t) + H \big[L(w_t) \big] (w_* - w_t) \]

\[\Large w_* = w_t - H^{-1} \big[L(w_t) \big] \nabla L(w_t) \]

\[\Large \eta = H^{-1} \big[L(w_t) \big] \]

If any of the eigenvalues of \( H^{-1} \) are 0, the directional derivative of its corresponding eigenvector is 0. This means that the loss function is either constant or monotonic in that direction, and does not have a global extreme.

Therefore \( H \) has a full set of non-zero eigenvalues and is invertible.

Interactive Demonstration

-1.0

-2.0

-3.0

w₀:

-4.0

Include cubic term (dw³)

f(w)  =-1.0w² + -2.0w + -3.0
f'(w)  =-2.0w + -2.0
f''(w)  =-2.0
η  = 1/f''(w₀) = -0.5
w₀  = -4.0, f(w₀) = -27.
f'(w₀)  = -10.
w₁  = -1.0, f(w₁) = -2.0
w*  = -1.0, f(w*) = -2.0

Takeaways

As the magnitude of the derivatives beyond the second increase, the estimation becomes less accurate
Functions rarely are only second order, but incorporating more derivatives into gradient descent can speed up convergence
The cost of calculating second order derivatives for fields grows with the dimension
Computing the Hessian and its inverse, can be more expensive than multiple steps of first gradient descent

One Step Learning Rate

Problem

Solution

Interactive Demonstration

Takeaways