Table of Contents
Fetching ...

Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm

Yilang Zhang, Bingcong Li, Georgios B. Giannakis

TL;DR

The paper tackles the generalization challenge in deep learning by refining sharpness-aware minimization (SAM) through a preconditioning lens. It introduces preSAM to unify SAM variants into constraint preconditioning (CP) and objective preconditioning (OP), with a convergent theory guiding design choices. Building on this, InfoSAM is proposed to counter adversarial model degradation caused by gradient noise by weighting gradient components according to estimated variance. Extensive experiments on CIFAR-10/100 and ImageNet, including label-noise scenarios, show InfoSAM consistently improves generalization over SAM, ASAM, and SGD, validating the practical value of the preSAM and InfoSAM framework.

Abstract

Targeting solutions over `flat' regions of the loss landscape, sharpness-aware minimization (SAM) has emerged as a powerful tool to improve generalizability of deep neural network based learning. While several SAM variants have been developed to this end, a unifying approach that also guides principled algorithm design has been elusive. This contribution leverages preconditioning (pre) to unify SAM variants and provide not only unifying convergence analysis, but also valuable insights. Building upon preSAM, a novel algorithm termed infoSAM is introduced to address the so-called adversarial model degradation issue in SAM by adjusting gradients depending on noise estimates. Extensive numerical tests demonstrate the superiority of infoSAM across various benchmarks.

Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm

TL;DR

The paper tackles the generalization challenge in deep learning by refining sharpness-aware minimization (SAM) through a preconditioning lens. It introduces preSAM to unify SAM variants into constraint preconditioning (CP) and objective preconditioning (OP), with a convergent theory guiding design choices. Building on this, InfoSAM is proposed to counter adversarial model degradation caused by gradient noise by weighting gradient components according to estimated variance. Extensive experiments on CIFAR-10/100 and ImageNet, including label-noise scenarios, show InfoSAM consistently improves generalization over SAM, ASAM, and SGD, validating the practical value of the preSAM and InfoSAM framework.

Abstract

Targeting solutions over `flat' regions of the loss landscape, sharpness-aware minimization (SAM) has emerged as a powerful tool to improve generalizability of deep neural network based learning. While several SAM variants have been developed to this end, a unifying approach that also guides principled algorithm design has been elusive. This contribution leverages preconditioning (pre) to unify SAM variants and provide not only unifying convergence analysis, but also valuable insights. Building upon preSAM, a novel algorithm termed infoSAM is introduced to address the so-called adversarial model degradation issue in SAM by adjusting gradients depending on noise estimates. Extensive numerical tests demonstrate the superiority of infoSAM across various benchmarks.
Paper Structure (23 sections, 3 theorems, 27 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 3 theorems, 27 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Suppose As. as.1 -- as.3 hold. Let $\eta_t \equiv \eta = \frac{\eta_0}{ \sqrt{T}} \le \frac{2}{3L}$, and $\rho = \frac{\rho_0}{\sqrt{T}}$. In addition, suppose $\| \mathbf{D}_t^{-1} \| \le D_0, \forall t$. Then, preSAM in Alg. alg.sam guarantees that

Figures (4)

  • Figure 1: (a) Top-1 and (b) top-5 accuracies on ImageNet.
  • Figure 2: Performance under different levels of label noise.
  • Figure 3: Behavior of SGD (left), ideal SAM (middle), and SAM with stochastic noise (right) near asymmetric valley. First row: transition from a sharper slope to a flatter one; second row: minimizing a flatter slope. Comparing middle with left reveals why SAM is helpful for finding a solution on flatter slope that generalizes better. The right part shows why gradient noise causes AMD.
  • Figure 4: Comparison of the adversarial models in (a) SAM and (b) infoSAM.

Theorems & Definitions (6)

  • Theorem 1: Unified convergence
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • proof