Table of Contents
Fetching ...

Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term

Yun Yue, Jiadi Jiang, Zhiling Ye, Ning Gao, Yongchao Liu, Ke Zhang

TL;DR

This work revisits Sharpness-Aware Minimization by treating sharpness as a weighted regularization term and introducing WSAM, controlled by a parameter $\gamma$. It provides convergence guarantees for both convex and non-convex stochastic settings and derives a PAC-Bayes–style generalization bound, while proposing a weight-decoupled implementation to reflect only the current-step sharpness. Empirically, WSAM yields better or competitive generalization compared to SAM and variants across CIFAR, ImageNet transfer, and label-noise robustness, with ablations showing the benefit of weight decouple and a stable hyperparameter regime around $\gamma \in [0.8,0.95]$. The combined theoretical and empirical results suggest WSAM as a practical, theoretically grounded alternative to SAM for improving generalization in deep networks.

Abstract

Deep Neural Networks (DNNs) generalization is known to be closely related to the flatness of minima, leading to the development of Sharpness-Aware Minimization (SAM) for seeking flatter minima and better generalization. In this paper, we revisit the loss of SAM and propose a more general method, called WSAM, by incorporating sharpness as a regularization term. We prove its generalization bound through the combination of PAC and Bayes-PAC techniques, and evaluate its performance on various public datasets. The results demonstrate that WSAM achieves improved generalization, or is at least highly competitive, compared to the vanilla optimizer, SAM and its variants. The code is available at https://github.com/intelligent-machine-learning/atorch/tree/main/atorch/optimizers.

Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term

TL;DR

This work revisits Sharpness-Aware Minimization by treating sharpness as a weighted regularization term and introducing WSAM, controlled by a parameter . It provides convergence guarantees for both convex and non-convex stochastic settings and derives a PAC-Bayes–style generalization bound, while proposing a weight-decoupled implementation to reflect only the current-step sharpness. Empirically, WSAM yields better or competitive generalization compared to SAM and variants across CIFAR, ImageNet transfer, and label-noise robustness, with ablations showing the benefit of weight decouple and a stable hyperparameter regime around . The combined theoretical and empirical results suggest WSAM as a practical, theoretically grounded alternative to SAM for improving generalization in deep networks.

Abstract

Deep Neural Networks (DNNs) generalization is known to be closely related to the flatness of minima, leading to the development of Sharpness-Aware Minimization (SAM) for seeking flatter minima and better generalization. In this paper, we revisit the loss of SAM and propose a more general method, called WSAM, by incorporating sharpness as a regularization term. We prove its generalization bound through the combination of PAC and Bayes-PAC techniques, and evaluate its performance on various public datasets. The results demonstrate that WSAM achieves improved generalization, or is at least highly competitive, compared to the vanilla optimizer, SAM and its variants. The code is available at https://github.com/intelligent-machine-learning/atorch/tree/main/atorch/optimizers.
Paper Structure (20 sections, 5 theorems, 16 equations, 3 figures, 9 tables, 4 algorithms)

This paper contains 20 sections, 5 theorems, 16 equations, 3 figures, 9 tables, 4 algorithms.

Key Result

Theorem 5.1

(Convergence in convex settings) Let $\{\bm{w}_t\}$ be the sequence obtained by Algorithm alg:SGD_WSAM, $\alpha_t = \alpha / \sqrt{t}$, $\rho_{t}\leq\rho$, $\|\bm{g}_t\|_{\infty} \leq G_{\infty}, \|\tilde{\bm{g}}_t\|_{\infty} \leq G_{\infty} \ \forall{}t\in[T]$. Suppose $\ell_t(\bm{w})$ is convex an where $C_1$ and $C_2$ are defined as follows:

Figures (3)

  • Figure 1: How WSAM updates on the choice of $\gamma$.
  • Figure 2: WSAM can achieve different minima by choosing different $\gamma$.
  • Figure 3: The sensitivity of WSAM's performance to the choice of $\gamma$.

Theorems & Definitions (5)

  • Theorem 5.1
  • Corollary 5.2
  • Theorem 5.3
  • Corollary 5.4
  • Theorem 5.5