Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long; Peter L. Bartlett

Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long, Peter L. Bartlett

TL;DR

Sharpness-Aware Minimization (SAM) modifies gradient-based training by evaluating gradients at a neighborhood around the current point to favor flatter minima. The paper derives a SAM-edge of stability, a Hessian-norm threshold that depends on the gradient norm $||g||$, learning rate $\eta$, and neighborhood radius $\rho$, and shows this edge lies below the conventional GD edge $2/\eta$. Empirically, SAM tracks this edge across MNIST, CIFAR10, and tiny_shakespeare, with the edge magnitude shrinking as gradients diminish and often yielding flatter minima with comparable or improved training loss. These findings connect sharpness-aware updates to dynamic stability and generalization, suggesting that SAM broadens the practical regime where stable, smooth optimization can occur.

Abstract

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $η$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/η$, after which it fluctuates around this value. The quantity $2/η$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

Sharpness-Aware Minimization and the Edge of Stability

TL;DR

, learning rate

, and neighborhood radius

, and shows this edge lies below the conventional GD edge

. Empirically, SAM tracks this edge across MNIST, CIFAR10, and tiny_shakespeare, with the edge magnitude shrinking as gradients diminish and often yielding flatter minima with comparable or improved training loss. These findings connect sharpness-aware updates to dynamic stability and generalization, suggesting that SAM broadens the practical regime where stable, smooth optimization can occur.

Abstract

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size

, the operator norm of the Hessian of the loss grows until it approximately reaches

, after which it fluctuates around this value. The quantity

has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

Paper Structure (13 sections, 3 theorems, 12 equations, 16 figures)

This paper contains 13 sections, 3 theorems, 12 equations, 16 figures.

Introduction
Derivation
Methods
Settings
Hyperparameters
Implementation
Unreported preliminary experiments
Results
MNIST
CIFAR10
Language modeling
Related work
Conclusion

Key Result

Proposition 1

For $w_t \in {\mathbb R}^d$, $\eta > 0$, if then

Figures (16)

Figure 1: The ratio of SAM's edge of stability to $2/\eta$, the edge of stability for GD, as a function of $\alpha=\eta\|g\|/(2\rho)$.
Figure 2: Magnitudes of the largest eigenvalues of the Hessian when an MLP is trained with GD on MNIST.
Figure 3: Magnitudes of the largest eigenvalues of the Hessian when an MLP is trained with SAM on MNIST, with $\rho=0.1$.
Figure 4: Magnitudes of the largest eigenvalues of the Hessian when an MLP is trained with SAM on MNIST, with $\rho=0.1$.
Figure 5: Training loss with GD and SAM on MNIST.
...and 11 more figures

Theorems & Definitions (5)

Proposition 1
proof
Proposition 2
Proposition 3
proof

Sharpness-Aware Minimization and the Edge of Stability

TL;DR

Abstract

Sharpness-Aware Minimization and the Edge of Stability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (5)