On the Convergence of Gradient Descent for Large Learning Rates
Alexandru Crăciun, Debarghya Ghoshdastidar
TL;DR
The paper addresses convergence of gradient descent with a fixed large learning rate, revealing a sharp transition at a critical value where convergence from almost all initializations fails. By leveraging dynamical-systems tools, it proves that for linear networks with quadratic loss, gradient descent converges only from a measure-zero set when the learning rate exceeds a threshold, and it extends these conclusions to broader losses under mild analytic conditions. Central results show that global minima form a smooth manifold, the gradient-descent map is non-singular, and the Hessian spectrum along the minima entails instability of all minima for large rates, with experiments on non-linear networks supporting the theory. The findings quantify intrinsic limitations of fixed-large-step gradient methods in practical settings and illuminate the Edge of Stability phenomena observed in training deep networks, providing guidelines for learning-rate choices and future theoretical extensions.
Abstract
A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a sharp transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.
