Adaptive Gradient Methods at the Edge of Stability

Jeremy M. Cohen; Behrooz Ghorbani; Shankar Krishnan; Naman Agarwal; Sourabh Medapati; Michal Badura; Daniel Suo; David Cardoze; Zachary Nado; George E. Dahl; Justin Gilmer

Adaptive Gradient Methods at the Edge of Stability

Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

TL;DR

Adaptive Gradient Methods at the Edge of Stability investigates how adaptive optimizers like Adam behave in full-batch and large minibatch regimes. It shows that the preconditioned sharpness equilibrates to a fixed stability threshold (approximately $\lambda_1(P^{-1}H) = \frac{2 + 2\beta_1}{\eta(1-\beta_1)}$, i.e., $38/\eta$ for $\beta_1=0.9$), defining the Adaptive Edge of Stability (AEoS). Unlike non-adaptive EoS, adaptive methods can progress into high-curvature regions by updating the preconditioner, though the preconditioned sharpness remains clamped near the threshold; minibatch results and warmup in transformers extend these findings to practical settings. The work also uncovers an implicit bias of adaptive methods toward higher-curvature solutions and analyzes how hyperparameters shape stability and generalization outcomes, laying groundwork for future theory.

Abstract

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

Adaptive Gradient Methods at the Edge of Stability

TL;DR

, i.e.,

for

), defining the Adaptive Edge of Stability (AEoS). Unlike non-adaptive EoS, adaptive methods can progress into high-curvature regions by updating the preconditioner, though the preconditioned sharpness remains clamped near the threshold; minibatch results and warmup in transformers extend these findings to practical settings. The work also uncovers an implicit bias of adaptive methods toward higher-curvature solutions and analyzes how hyperparameters shape stability and generalization outcomes, laying groundwork for future theory.

Abstract

and

, this stability threshold is

. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

Paper Structure (42 sections, 3 theorems, 13 equations, 19 figures, 1 table)

This paper contains 42 sections, 3 theorems, 13 equations, 19 figures, 1 table.

Introduction
Notation and Background
Gradient descent and momentum
Preconditioned gradient descent
Adaptive gradient methods
Related Work
Training dynamics of non-adaptive gradient descent
Understanding adaptive gradient methods
Implicit bias of adaptive gradient methods
Full-batch adaptive optimizers train at the Adaptive Edge of Stability
Warm-up: "frozen Adam"
Other architectures
Other full-batch adaptive gradient methods
A corner case
The minibatch setting
...and 27 more sections

Key Result

Lemma 1

Consider running $\textsc{EmaNesterov}(\eta, \beta_1)$ on a quadratic objective $f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T \mathbf{A} \mathbf{x} + \mathbf{b}^T \mathbf{x} + c$ starting from any initialization. Let $(\mathbf{q}, a)$ be an eigenvector/eigenvalue pair of $\mathbf{A}$. If $a > \frac{1}{\

Figures (19)

Figure 1: Full-batch Adam trains at the Adaptive Edge of Stability (AEoS). We train a fully-connected network on CIFAR-10 using full-batch Adam with various hyperparameters $\eta, \beta_1, \beta_2$. Observe that the maximum eigenvalue of the preconditioned Hessian equilibrates at the numerical value $\frac{(2 + 2 \beta_1)}{(1 - \beta_1) \eta}$, which is drawn as a dashed horizontal line. However, in contrast to the non-adaptive EoS, the maximum eigenvalue of the raw Hessian usually keeps rising at the AEoS (see Figure \ref{['fig:hessian-rises']}).
Figure 2: We optimize the quadratic objective $f(x) = \frac{1}{2} x^2$ using $\textsc{EmaHB}(\eta, 0.9)$ at various $\eta$. Observe that learning rates $\eta$ above 38 diverge.
Figure 3: "Frozen Adam" (preconditioned momentum) trains at the Edge of Stability. We train a fully-connected network on CIFAR-10 using "frozen Adam," i.e. preconditioned gradient descent with EMA-style heavy ball momentum and a fixed preconditioner $\mathbf{P}$. Consistent with cohen2021gradient, the preconditioned sharpness $\lambda_1(\mathbf{P}^{-1} \, \mathbf{H}_t)$ rises until equilibrating at the stability threshold of $38/\eta$. The raw sharpness $\lambda_1(\mathbf{H}_t)$ mostly ceases to increase at the EoS.
Figure 4: The phenomenon generalizes to other architectures. We train three vision architectures using full-batch Adam with $\beta_1 = 0.9$ and $\beta_2= 0.999$ at a range of learning rates (colors). In each case, the preconditioned sharpness equilibrates at the threshold $38/\eta$. Each network was trained until either reaching a milestone train loss value, or until reaching a step limit.
Figure 5: Other adaptive gradient algorithms train at the AEoS. We train a FC network on CIFAR-10 using eight adaptive optimizers in full-batch mode. We train each algorithm at five learning rates (colors). In each case, the preconditioned sharpness equilibrates at, or just above, the stability threshold (written in parentheses). Note that the qualitative behavior of the preconditioned sharpness depends on the presence and type of momentum; see Appendix C.
...and 14 more figures

Theorems & Definitions (6)

Lemma 1
proof
Lemma 2
proof
Proposition 1
proof

Adaptive Gradient Methods at the Edge of Stability

TL;DR

Abstract

Adaptive Gradient Methods at the Edge of Stability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (6)