Table of Contents
Fetching ...

Sharpness-Aware Minimization: General Analysis and Improved Rates

Dimitris Oikonomou, Nicolas Loizou

TL;DR

Sharpness-Aware Minimization (SAM) is leveraged to improve generalization by reducing loss sharpness. The paper unifies the two main SAM variants (NSAM and USAM) into a single Unified SAM framework under arbitrary sampling, while replacing common noise assumptions with the relaxed Expected Residual (ER) condition. It then establishes convergence guarantees for Polyak-Lojasiewicz (PL) and general non-convex objectives, including importance sampling, and validates the theory with experiments on image classification. The results show that flexible normalization and sampling, together with the unified update, yield improved convergence rates and practical performance across synthetic and real tasks.

Abstract

Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks.

Sharpness-Aware Minimization: General Analysis and Improved Rates

TL;DR

Sharpness-Aware Minimization (SAM) is leveraged to improve generalization by reducing loss sharpness. The paper unifies the two main SAM variants (NSAM and USAM) into a single Unified SAM framework under arbitrary sampling, while replacing common noise assumptions with the relaxed Expected Residual (ER) condition. It then establishes convergence guarantees for Polyak-Lojasiewicz (PL) and general non-convex objectives, including importance sampling, and validates the theory with experiments on image classification. The results show that flexible normalization and sampling, together with the unified update, yield improved convergence rates and practical performance across synthetic and real tasks.

Abstract

Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks.

Paper Structure

This paper contains 22 sections, 18 theorems, 81 equations, 4 figures, 9 tables.

Key Result

Theorem 3.2

Assume that each $f_i$ is $L_i$-smooth, $f$ is $\mu$-PL and the eq:abc is satisfied. Set $L_{\max}=\max_{i\in[n]}L_i$. Then the iterates of eq:unifiedsam with satisfy: where $N=\frac{L_{\max}}{\mu}\left(C\gamma+\rho(1+2\gamma L_{\max}^2\rho)\left[\lambda^2+C(1-\lambda)^2\right]\right)$.

Figures (4)

  • Figure 1: Deterministic \ref{['eq:unifiedsam']} for various values of $\lambda$ applied to the ridge regression problem. \ref{['eq:usam']} ($\lambda=0$) converges to the exact solution while the other variants $\lambda>0$ converge to a neighborhood of the solution.
  • Figure 2: Comparison between constant and decreasing step size regimes of \ref{['eq:unifiedsam']}. From left to right we have $\lambda=0.0, 0.5, 1.0$
  • Figure 3: Comparison between uniform and importance sampling for \ref{['eq:unifiedsam']}. From left to right we have $\lambda=0.0, 0.5, 1.0$
  • Figure : Deterministic \ref{['eq:unifiedsam']} for various values of $\lambda$ applied to the ridge regression problem. \ref{['eq:usam']} ($\lambda=0$) converges to the exact solution while the other variants $\lambda>0$ converge to a neighborhood of the solution.

Theorems & Definitions (28)

  • Theorem 3.2
  • Corollary 3.3
  • Corollary 3.4: Deterministic SAM
  • Theorem 3.5
  • Proposition 3.6
  • Theorem 3.7
  • Definition A.1: $L$-smooth
  • Lemma A.2
  • Definition A.3: $\mu$-PL
  • Definition A.4: Interpolation
  • ...and 18 more