Table of Contents
Fetching ...

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Hristo Papazov, Scott Pesme, Nicolas Flammarion

TL;DR

This work uses a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows it to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule.

Abstract

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $γ$ and momentum parameter $β$ that allows us to identify an intrinsic quantity $λ= \frac{ γ}{ (1 - β)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $λ$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

TL;DR

This work uses a continuous-time approach in the analysis of momentum gradient descent with step size and momentum parameter that allows it to identify an intrinsic quantity which uniquely defines the optimisation path and provides a simple acceleration rule.

Abstract

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size and momentum parameter that allows us to identify an intrinsic quantity which uniquely defines the optimisation path and provides a simple acceleration rule. When training a -layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.
Paper Structure (40 sections, 23 theorems, 197 equations, 9 figures)

This paper contains 40 sections, 23 theorems, 197 equations, 9 figures.

Key Result

Proposition 1

For $(w_0, w_1) \in {\mathbb{R}} ^{2d}$, consider momentum gradient flow mgf:lambda with and initialisation $w_{t = 0} = w_0$, $\dot{w}_{t=0} = (w_1 - w_0)/\sqrt{\lambda \gamma}$. Then, discretising as mgf:discretisation with discretisation step $\varepsilon = \sqrt{\lambda \gamma} = {\gamma} /(1- {\beta} )$ leads to the momentum gradient descent recursion mgd with step size $\gamma$,

Figures (9)

  • Figure 1: (M)GD over a $2$D quadratic. Left and Middle: The (M)GD trajectories closely follow the continuous trajectories of (M)GF as suggested by \ref{['prop:discretisation']}. Right: MGD$(4 \gamma, \beta^2)$ follows the same trajectory as MGD$(\gamma, \beta)$ but twice as fast as suggested by \ref{['cor:speedup']}. In contrast, GD$(4 \gamma)$ runs four times faster than GD$(\gamma)$.
  • Figure 2: Teacher-student framework with a fully-connected $1$-hidden layer ReLU network. The level lines of the test loss after training with \ref{['mgd']} correspond to values of $\gamma, \beta$ which have a fixed value $\lambda = \gamma / (1 - \beta)^2$, as predicted by \ref{['prop:discretisation']}.
  • Figure 3: Test loss (in blue) and magnitude of balancedness (in red) at convergence of \ref{['mgf:lambda']} over a diagonal linear network in a sparse regression setting with uncentered data. As predicted by \ref{['main_mgf:general']}, a more balanced solution generalises better. The shaded zone corresponds to values of $\lambda$ for which the balancedness never hits zero during training and for which \ref{['main_mgf']} therefore holds.
  • Figure 4: (Non-stochastic) MGD over a diagonal linear network in a sparse regression setting with uncentered data. As predicted by \ref{['prop:discretisation']}, the three quantities at convergence only depend on the single parameter $\lambda \coloneqq \gamma / (1 - \beta)^2$. As predicted by \ref{['main_mgd:general']}, a more balanced solution (center plot) leads to a solution with a smaller $\ell_1$-norm (right plot), which in turn translates into better generalisation (left plot). Finally, as predicted by \ref{['main_mgd']}, the trajectories for which the iterates do not cross zero satisfy $\Delta_\infty < \Delta_0$, where $\Delta_0$ (approximately) corresponds to the asymptotic balancedness for $\beta = 0$ and $\gamma = 10^{-3}$.
  • Figure 5: A visualisation of the areas over which we integrate $( \frac{\dot{w}_s}{w_s} )^2 e^{-\frac{t-s}{\lambda}} \mathop{\mathrm{sgn}}\nolimits(w_t w_s)$ in the above limit.
  • ...and 4 more figures

Theorems & Definitions (32)

  • Proposition 1
  • Corollary 1: Acceleration rule
  • Proposition 2: alvarez_convex_heavy_ball
  • Definition : Balancedness
  • Lemma 1
  • Theorem 1
  • Corollary 2
  • Proposition 3
  • Proposition 4
  • Lemma 2
  • ...and 22 more