Sharpness-Aware Minimization Can Hallucinate Minimizers
Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
TL;DR
This work reveals a previously underappreciated failure mode of Sharpness-Aware Minimization (SAM): in nonconvex landscapes, the standard shifted-gradient updates can cause SAM to stall at hallucinated minimizers, i.e., points where the shifted gradient vanishes while the original gradient remains nonzero. The authors provide a geometric and dynamical framework showing the existence of such points for a nontrivial interval of the perturbation radius $\rho$, extend the results to local maximizer sets under real-analyticity, and prove local attractor properties for isolated hallucinated minimizers and manifolds inherited from minimizer sets. They verify these phenomena empirically on neural networks, and demonstrate that a short SGD warm-start before enabling SAM mitigates the issue and reduces sensitivity to $\rho$. The findings highlight a critical distinction between the surrogate SAM objective and the actual SAM dynamics, with practical implications for selecting $\rho$ and for training stability in deep learning. Overall, the work advances theoretical understanding of SAM dynamics and offers a simple, effective safeguard to improve robustness in practice.
Abstract
Sharpness-Aware Minimization (SAM) is widely used to seek flatter minima -- often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance $ρ$ along the normalized gradient direction. We show that, for some choices of $ρ$, SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM's performance degradation often seen at large $ρ$. Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of $ρ$.
