Adaptive Gradient Methods at the Edge of Stability
Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer
TL;DR
Adaptive Gradient Methods at the Edge of Stability investigates how adaptive optimizers like Adam behave in full-batch and large minibatch regimes. It shows that the preconditioned sharpness equilibrates to a fixed stability threshold (approximately $\lambda_1(P^{-1}H) = \frac{2 + 2\beta_1}{\eta(1-\beta_1)}$, i.e., $38/\eta$ for $\beta_1=0.9$), defining the Adaptive Edge of Stability (AEoS). Unlike non-adaptive EoS, adaptive methods can progress into high-curvature regions by updating the preconditioner, though the preconditioned sharpness remains clamped near the threshold; minibatch results and warmup in transformers extend these findings to practical settings. The work also uncovers an implicit bias of adaptive methods toward higher-curvature solutions and analyzes how hyperparameters shape stability and generalization outcomes, laying groundwork for future theory.
Abstract
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.
