Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, Junchi Yan
TL;DR
This work investigates why Sharpness-Aware Minimization (SAM) improves generalization and reveals that SAM can efficiently select flatter minima when applied late in training. It introduces a switching scheme between SGD and SAM and uncovers a two-phase training dynamic: an initial fast escape from the sharp SGD minimum, followed by rapid convergence to a flatter minimum within the same valley. The authors provide both linear-stability and non-local analyses (P1–P4) to justify SAM's bias toward flatter minima and demonstrate exponential convergence under standard smoothness and PL conditions. Empirically, late-phase SAM matches full SAM in test performance and sharpness across datasets, and the findings extend to Adversarial Training (AT). The results highlight the importance of late-stage optimization in shaping generalization and robustness, with practical implications for reducing compute via short late-phase SAM runs.
Abstract
Sharpness-Aware Minimization (SAM) has substantially improved the generalization of neural networks under various settings. Despite the success, its effectiveness remains poorly understood. In this work, we discover an intriguing phenomenon in the training dynamics of SAM, shedding light on understanding its implicit bias towards flatter minima over Stochastic Gradient Descent (SGD). Specifically, we find that SAM efficiently selects flatter minima late in training. Remarkably, even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training. Subsequently, we delve deeper into the underlying mechanism behind this phenomenon. Theoretically, we identify two phases in the learning dynamics after applying SAM late in training: i) SAM first escapes the minimum found by SGD exponentially fast; and ii) then rapidly converges to a flatter minimum within the same valley. Furthermore, we empirically investigate the role of SAM during the early training phase. We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution's properties. Based on this viewpoint, we extend our findings from SAM to Adversarial Training.
