Sharpness-Aware Minimization and the Edge of Stability
Philip M. Long, Peter L. Bartlett
TL;DR
Sharpness-Aware Minimization (SAM) modifies gradient-based training by evaluating gradients at a neighborhood around the current point to favor flatter minima. The paper derives a SAM-edge of stability, a Hessian-norm threshold that depends on the gradient norm $||g||$, learning rate $\eta$, and neighborhood radius $\rho$, and shows this edge lies below the conventional GD edge $2/\eta$. Empirically, SAM tracks this edge across MNIST, CIFAR10, and tiny_shakespeare, with the edge magnitude shrinking as gradients diminish and often yielding flatter minima with comparable or improved training loss. These findings connect sharpness-aware updates to dynamic stability and generalization, suggesting that SAM broadens the practical regime where stable, smooth optimization can occur.
Abstract
Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $η$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/η$, after which it fluctuates around this value. The quantity $2/η$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
