On the Convergence of Gradient Descent for Large Learning Rates

Alexandru Crăciun; Debarghya Ghoshdastidar

On the Convergence of Gradient Descent for Large Learning Rates

Alexandru Crăciun, Debarghya Ghoshdastidar

TL;DR

The paper addresses convergence of gradient descent with a fixed large learning rate, revealing a sharp transition at a critical value where convergence from almost all initializations fails. By leveraging dynamical-systems tools, it proves that for linear networks with quadratic loss, gradient descent converges only from a measure-zero set when the learning rate exceeds a threshold, and it extends these conclusions to broader losses under mild analytic conditions. Central results show that global minima form a smooth manifold, the gradient-descent map is non-singular, and the Hessian spectrum along the minima entails instability of all minima for large rates, with experiments on non-linear networks supporting the theory. The findings quantify intrinsic limitations of fixed-large-step gradient methods in practical settings and illuminate the Edge of Stability phenomena observed in training deep networks, providing guidelines for learning-rate choices and future theoretical extensions.

Abstract

A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a sharp transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.

On the Convergence of Gradient Descent for Large Learning Rates

TL;DR

Abstract

Paper Structure (25 sections, 24 theorems, 21 equations, 4 figures)

This paper contains 25 sections, 24 theorems, 21 equations, 4 figures.

Introduction
Technical Contributions
Related Work
Analyzing the Loss Landscape
Optimization using Gradient Descent
Gradient Descent and Dynamical Systems
Preliminaries
Gradient Descent
Loss Landscape of Neural Networks
Main Results: Linear Networks
Geometry of the minima and the spectrum of $H_L$
Asymptotic behaviour of the Hessian on $M$
Gradient descent map is non-singular
Dynamics of gradient descent
Extension: Non-linear Networks
...and 10 more sections

Key Result

Proposition 1

If the singular values of $W^*_0$ are pairwise distinct and positive, $l|_{\mathcal{M}_r}$ has a unique global minimum. In the filling case, the minimum of $l|_{\mathcal{M}_r}$ agrees with that of $l$. In the non-filling case, the minima are different, however, if $W^*$ is the minimum of $l|_{\mathc

Figures (4)

Figure 1: Dynamics of gradient descent in two regimes for $L(x,y) = \frac{1}{2}(1-xy)^2$: (a) Trajectories for $\eta = 0.4$. $M_{WS}$ is red and iterations converge to points in $M_{WS}$; (b) Trajectory for $\eta = 1.1$. $M_{WS} = \varnothing$ and trajectories no longer converge. In both figures $R\subset\mathbb{R}^2$ is the displayed region.
Figure 2: For the loss $\frac{1}{2}(1-xy)^2$, we plot the length of the manifold of weakly stable minima $M_{WS}$ for $\eta \in [0.1, 1]$, normalised to the length of $M_{WS}$ at $\eta=0.1$, which is $37.4$ (more details in the example above). As a proxy for the size of $M_{WS}$ (since it can't be computed explicitly for more complex losses), we also plot the ratio between the number of initialisations that converge to a minimum (if this happens we say that the initialisation lies in the trapping region for that minimum) and the total number of initialisations.
Figure 3: Percentage of trajectories that converge after gradient descent training with fixed step size on MNIST. GELU and MSE were used as activation function and loss. The number in the legend specifies the network depth. Convergence does not happen for $\eta > 1$.
Figure 4: Size of the trapping region (as a percentage of trajectories that converge out of 100 total initialisations) for a range of step sizes and models trained with SGD on MNIST. The number in the legend represents the number of hidden layers.

Theorems & Definitions (60)

Example 1
Definition 1: Lyapunov stability
Remark 1: Stability of fixed points from linearization
Definition 2: Stable set
Definition 3: Notation: $M, Crit(L), H_L$
Remark 2: Hessian determines linearization $DG(\theta)$
Definition 4: Filling and non-filling architectures
Proposition 1: trager2020
Remark 3: Our analysis is based on the framework from trager2020
Proposition 2: Geometry of $M$
...and 50 more

On the Convergence of Gradient Descent for Large Learning Rates

TL;DR

Abstract

On the Convergence of Gradient Descent for Large Learning Rates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (60)