Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

Pierfrancesco Beneventano; Blake Woodworth

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

Pierfrancesco Beneventano, Blake Woodworth

TL;DR

The paper analyzes gradient descent on a depth-2 linear network parameterized by m = a^T b for a univariate regression loss L(a,b) = 1/2 (a^T b - Φ)^2, unveiling a linear convergence rate to a global minimum under explicit step-size conditions. It introduces three key quantities—ε (residual), λ (norm), and Q (imbalance)—and shows GD reduces imbalance while gradient flow conserves it, leading GD to converge to flatter minima than GF. The authors reveal a trade-off: larger step sizes enhance implicit regularization (flatter solutions) but can slow convergence, especially near the edge of stability, where the dynamics may linger yet still reduce loss. These insights illuminate how discretization and implicit bias influence optimization in simple models, with implications for understanding training dynamics and generalization in more complex networks. Overall, the work provides a concrete, rate-based account of how GD can outperform GF in promoting flatter minimizers through implicit regularization in a non-convex, under-determined setting.

Abstract

We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

TL;DR

Abstract

. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.

Paper Structure (39 sections, 24 theorems, 117 equations, 3 figures)

This paper contains 39 sections, 24 theorems, 117 equations, 3 figures.

Introduction
1. Convergence of Gradient Descent.
2. Location of Convergence.
Key Implications.
Technical Overview.
Related Work
Preliminaries
Residuals.
Norm of the parameters.
The imbalance.
Location of Convergence
Quantifying the Implicit Regularization.
Speed of Convergence
Proof Sketch
Conclusion
...and 24 more sections

Key Result

Theorem 1

Let $0 < \eta < \min\Bigl\{\tfrac{1}{2|\boldsymbol{\varepsilon}(0)|}, \tfrac{2}{\sqrt{\boldsymbol{\lambda}(0)^2 + 4\,\Phi^2}}\Bigr\}.$ Then, at the limit point of gradient descent we haveA similar result holds for larger stepsizes, but its statement is more involved. See Appendix app:location. for all $i \in \{1,2, \ldots, d\}$.

Figures (3)

Figure 1: Case for $\mathbf{a}=a,\mathbf{b}=b \in {\mathbb{R}}$. Under gradient flow, the trajectory curves away from the origin, conserving $Q_i$. GD’s discrete step moves along the affine tangent space, shrinking $Q_i$.
Figure 2: Schematic of GD behaviors in three different regions: (A) $\boldsymbol{\varepsilon} > 0$, (B) $\boldsymbol{\varepsilon} < 0 < \mathbf{a}^\top\mathbf{b}$, (C) $\mathbf{a}^\top\mathbf{b} < 0$. See text for details.
Figure 3: Gradient descent on \ref{['problem:def']} with $\Phi=1$ and various step sizes and initial scales. Left: Ratio $Q(T)/Q(0)$ showing how little $Q$ changes when $\eta \,\boldsymbol{\lambda}(0)$ is small and how strongly it is reduced for large $\eta \,\boldsymbol{\lambda}(0)$. Right: The time to converge to a small residual, illustrating slower convergence in cases with stronger $Q$-regularization. The chaotic behavior appears when $\eta \geq 1/\boldsymbol{\varepsilon}$.

Theorems & Definitions (39)

Theorem 1
Theorem 2
Lemma 1
Lemma 2
Lemma 3
proof
Lemma 4
Lemma 5
proof
Lemma 6
...and 29 more

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

TL;DR

Abstract

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (39)