Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Hristo Papazov; Scott Pesme; Nicolas Flammarion

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Hristo Papazov, Scott Pesme, Nicolas Flammarion

TL;DR

This work uses a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows it to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule.

Abstract

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $γ$ and momentum parameter $β$ that allows us to identify an intrinsic quantity $λ= \frac{ γ}{ (1 - β)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $λ$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

TL;DR

This work uses a continuous-time approach in the analysis of momentum gradient descent with step size

and momentum parameter

that allows it to identify an intrinsic quantity

which uniquely defines the optimisation path and provides a simple acceleration rule.

Abstract

and momentum parameter

that allows us to identify an intrinsic quantity

which uniquely defines the optimisation path and provides a simple acceleration rule. When training a

-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of

help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

Paper Structure (40 sections, 23 theorems, 197 equations, 9 figures)

This paper contains 40 sections, 23 theorems, 197 equations, 9 figures.

Introduction
Main Contributions
Related Works
From Discrete to Continuous
Momentum Gradient Flow over Diagonal Linear Networks
Implicit Bias of Gradient Flow
Implicit Bias of Momentum Gradient Flow
General Characterisation of MGF Bias
Provable Benefits of Momentum for Small Values of $\lambda$
Sketch of Proof
Momentum SGD over Diagonal Linear Networks
General Characterisation of SMGD Bias
Conclusion
Additional Notations and Comments on Discretisation Methods
$(w_+, w_-)$-Reparametrisation
...and 25 more sections

Key Result

Proposition 1

For $(w_0, w_1) \in {\mathbb{R}} ^{2d}$, consider momentum gradient flow mgf:lambda with and initialisation $w_{t = 0} = w_0$, $\dot{w}_{t=0} = (w_1 - w_0)/\sqrt{\lambda \gamma}$. Then, discretising as mgf:discretisation with discretisation step $\varepsilon = \sqrt{\lambda \gamma} = {\gamma} /(1- {\beta} )$ leads to the momentum gradient descent recursion mgd with step size $\gamma$,

Figures (9)

Figure 1: (M)GD over a $2$D quadratic. Left and Middle: The (M)GD trajectories closely follow the continuous trajectories of (M)GF as suggested by \ref{['prop:discretisation']}. Right: MGD$(4 \gamma, \beta^2)$ follows the same trajectory as MGD$(\gamma, \beta)$ but twice as fast as suggested by \ref{['cor:speedup']}. In contrast, GD$(4 \gamma)$ runs four times faster than GD$(\gamma)$.
Figure 2: Teacher-student framework with a fully-connected $1$-hidden layer ReLU network. The level lines of the test loss after training with \ref{['mgd']} correspond to values of $\gamma, \beta$ which have a fixed value $\lambda = \gamma / (1 - \beta)^2$, as predicted by \ref{['prop:discretisation']}.
Figure 3: Test loss (in blue) and magnitude of balancedness (in red) at convergence of \ref{['mgf:lambda']} over a diagonal linear network in a sparse regression setting with uncentered data. As predicted by \ref{['main_mgf:general']}, a more balanced solution generalises better. The shaded zone corresponds to values of $\lambda$ for which the balancedness never hits zero during training and for which \ref{['main_mgf']} therefore holds.
Figure 4: (Non-stochastic) MGD over a diagonal linear network in a sparse regression setting with uncentered data. As predicted by \ref{['prop:discretisation']}, the three quantities at convergence only depend on the single parameter $\lambda \coloneqq \gamma / (1 - \beta)^2$. As predicted by \ref{['main_mgd:general']}, a more balanced solution (center plot) leads to a solution with a smaller $\ell_1$-norm (right plot), which in turn translates into better generalisation (left plot). Finally, as predicted by \ref{['main_mgd']}, the trajectories for which the iterates do not cross zero satisfy $\Delta_\infty < \Delta_0$, where $\Delta_0$ (approximately) corresponds to the asymptotic balancedness for $\beta = 0$ and $\gamma = 10^{-3}$.
Figure 5: A visualisation of the areas over which we integrate $( \frac{\dot{w}_s}{w_s} )^2 e^{-\frac{t-s}{\lambda}} \mathop{\mathrm{sgn}}\nolimits(w_t w_s)$ in the above limit.
...and 4 more figures

Theorems & Definitions (32)

Proposition 1
Corollary 1: Acceleration rule
Proposition 2: alvarez_convex_heavy_ball
Definition : Balancedness
Lemma 1
Theorem 1
Corollary 2
Proposition 3
Proposition 4
Lemma 2
...and 22 more

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

TL;DR

Abstract

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (32)