Table of Contents
Fetching ...

Sharpness-Aware Minimization Can Hallucinate Minimizers

Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang

TL;DR

This work reveals a previously underappreciated failure mode of Sharpness-Aware Minimization (SAM): in nonconvex landscapes, the standard shifted-gradient updates can cause SAM to stall at hallucinated minimizers, i.e., points where the shifted gradient vanishes while the original gradient remains nonzero. The authors provide a geometric and dynamical framework showing the existence of such points for a nontrivial interval of the perturbation radius $\rho$, extend the results to local maximizer sets under real-analyticity, and prove local attractor properties for isolated hallucinated minimizers and manifolds inherited from minimizer sets. They verify these phenomena empirically on neural networks, and demonstrate that a short SGD warm-start before enabling SAM mitigates the issue and reduces sensitivity to $\rho$. The findings highlight a critical distinction between the surrogate SAM objective and the actual SAM dynamics, with practical implications for selecting $\rho$ and for training stability in deep learning. Overall, the work advances theoretical understanding of SAM dynamics and offers a simple, effective safeguard to improve robustness in practice.

Abstract

Sharpness-Aware Minimization (SAM) is widely used to seek flatter minima -- often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance $ρ$ along the normalized gradient direction. We show that, for some choices of $ρ$, SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM's performance degradation often seen at large $ρ$. Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of $ρ$.

Sharpness-Aware Minimization Can Hallucinate Minimizers

TL;DR

This work reveals a previously underappreciated failure mode of Sharpness-Aware Minimization (SAM): in nonconvex landscapes, the standard shifted-gradient updates can cause SAM to stall at hallucinated minimizers, i.e., points where the shifted gradient vanishes while the original gradient remains nonzero. The authors provide a geometric and dynamical framework showing the existence of such points for a nontrivial interval of the perturbation radius , extend the results to local maximizer sets under real-analyticity, and prove local attractor properties for isolated hallucinated minimizers and manifolds inherited from minimizer sets. They verify these phenomena empirically on neural networks, and demonstrate that a short SGD warm-start before enabling SAM mitigates the issue and reduces sensitivity to . The findings highlight a critical distinction between the surrogate SAM objective and the actual SAM dynamics, with practical implications for selecting and for training stability in deep learning. Overall, the work advances theoretical understanding of SAM dynamics and offers a simple, effective safeguard to improve robustness in practice.

Abstract

Sharpness-Aware Minimization (SAM) is widely used to seek flatter minima -- often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance along the normalized gradient direction. We show that, for some choices of , SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM's performance degradation often seen at large . Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of .

Paper Structure

This paper contains 54 sections, 22 theorems, 77 equations, 14 figures, 1 table.

Key Result

Theorem 2.2

Let $f\colon\mathbb{R}^d\rightarrow\mathbb{R}$ be continuously differentiable. Assume $f$ has a global minimizer $x_\star$ (not necessarily unique) and an isolated local maximizer $x^\bullet$. Then, there exists a nontrivial interval of radii $\rho$ for which hallucinated minimizers exist, and this

Figures (14)

  • Figure 1: Illustrative example of hallucinated minimizers. See Appendix \ref{['app:exp1']} for details. (a) Smooth function $f$ with a minimizer set and an isolated maximizer. (b) $f^{\mathrm{SAM}}=f(x+\rho \, u(x))$; its minimizers (labeled as "SAM minimizers" in the plot) do not correspond to minimizers or stationary points of $f$ and are therefore hallucinated. (c) Vector field of the SAM gradient $\nabla f(x^+)$; the hallucinated minimizers are attractors of the SAM iteration.
  • Figure 2: Illustration of the proof for Theorem \ref{['main:thm1']}. The point $x_h$ is the farthest from $x_\star$ among the points in $C_{\varepsilon}$. By the method of Lagrange multipliers, its gradient $\nabla f(x_h)$ points exactly toward $x_\star$.
  • Figure 3: Visualizations of $f$ and $f^{\mathrm{SAM}}$ around the hallucinated minimizer $x_h$. Plots (a) and (b) are taken on a two-dimensional plane defined by $x_h$ and two random directions. These show that $x_h$ is not stationary for $f$, while it appears as a minimizer of $f^{\mathrm{SAM}}$ on the same plane. Plot (c) depicts $f^{\mathrm{SAM}}$ on the two-dimensional plane containing $x_h$, $x_0$, and $x_N$, where $x_0$ is a small perturbation of $x_h$ and $x_N$ is obtained after $N=1000$ SAM steps from $x_0$. The pink horizontal line segment indicates the set of hallucinated minimizers, showing that the SAM trajectory converges back to this set.
  • Figure 4: Full-batch MNIST (2-layer Tanh network): training loss and test accuracy for SAM-only (top) and GD$\rightarrow$SAM switching (bottom) across perturbation radii. Thin curves are individual seeds (80 total) and bold curves are means. At larger $\rho$, SAM-only frequently plateaus at positive loss and suffers accuracy collapse, while switching remains stable across $\rho$.
  • Figure 5: Final test accuracy for SAM-only and the switching strategy on CIFAR-100 with ResNet-18 using stochastic gradients. Each curve shows the mean (bold) and standard deviation (shaded area) over 5 seeds, evaluated at perturbation radii $\rho = 0.1, 0.2, \ldots, 0.8$. Both methods achieve peak accuracy at $\rho = 0.3$, with 77.05% for SAM-only and 77.49% for the switching strategy.
  • ...and 9 more figures

Theorems & Definitions (41)

  • Definition 2.1: Hallucinated minimizers
  • Theorem 2.2
  • proof : Sketch of proof
  • Definition 2.3: Local maximizer set
  • Theorem 2.4
  • Theorem 3.1
  • Theorem 3.2
  • Proposition A.1
  • proof
  • Theorem A.1
  • ...and 31 more