Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks
Pierfrancesco Beneventano, Blake Woodworth
TL;DR
The paper analyzes gradient descent on a depth-2 linear network parameterized by m = a^T b for a univariate regression loss L(a,b) = 1/2 (a^T b - Φ)^2, unveiling a linear convergence rate to a global minimum under explicit step-size conditions. It introduces three key quantities—ε (residual), λ (norm), and Q (imbalance)—and shows GD reduces imbalance while gradient flow conserves it, leading GD to converge to flatter minima than GF. The authors reveal a trade-off: larger step sizes enhance implicit regularization (flatter solutions) but can slow convergence, especially near the edge of stability, where the dynamics may linger yet still reduce loss. These insights illuminate how discretization and implicit bias influence optimization in simple models, with implications for understanding training dynamics and generalization in more complex networks. Overall, the work provides a concrete, rate-based account of how GD can outperform GF in promoting flatter minimizers through implicit regularization in a non-convex, under-determined setting.
Abstract
We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.
