Table of Contents
Fetching ...

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon, Dongkuk Si, Chulhee Yun

TL;DR

The theoretical analysis attributes this phenomenon to Sharpness-Aware Minimization's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient.

Abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

TL;DR

The theoretical analysis attributes this phenomenon to Sharpness-Aware Minimization's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient.

Abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training -layer linear diagonal networks on linearly separable binary classification. For linear models (), both - and -SAM recover the max-margin classifier, matching gradient descent (GD). However, for depth , the behavior changes drastically -- even on a single-example dataset. For -SAM, the limit direction depends critically on initialization and can converge to or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For -SAM, we show that although its limit direction matches the max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to -SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.
Paper Structure (89 sections, 43 theorems, 354 equations, 31 figures)

This paper contains 89 sections, 43 theorems, 354 equations, 31 figures.

Key Result

Theorem 3.1

For almost every dataset which is linearly separable, any perturbation radius $\rho$ and any initialization, consider the linear model $f({\bm{x}})=\langle {\bm{w}}, {\bm{x}} \rangle$ trained with logistic loss. Then, $\ell_\infty$-SAM flow directionally converges in the $\ell_2$ max-margin directio

Figures (31)

  • Figure 1: Trajectories of the predictor ${\bm{\beta}}(t) \in\mathbb{R}^2$ from identical initial conditions under discrete GD, $\ell_\infty$-SAM and $\ell_2$-SAM on $\{ ({\bm{\mu}}, +1)\}$ with ${\bm{\mu}} = (1,2)$. We used $\eta=0.3$ and $\rho=1$ for SAM.
  • Figure 2: Trajectories $\beta(t)$ from identical initializations under GF and $\ell_\infty$-SAM flow with $d=2$ and ${\bm{\mu}} = (1,2)$. For SAM, $\rho=1$.
  • Figure 3: Rescaled $\ell_2$-SAM flow on $\mathcal{D}_{\bm{\mu}}$ with ${\bm{\mu}}= (4,5,6,7,8) \in \mathbb{R}^5$ and $\rho=1$.
  • Figure 4: Loss curves of GD (left) and $\ell_2$-SAM (right) on a 2-layer diagonal network in Regime 2 ($\alpha=0.35$, $\mu = (1,2,3,4,5,6)$, $\rho=0.1$). Colored regions mark the coordinate with highest growth.
  • Figure 5: Grad-CAM comparison of GD and $\ell_2$-SAM on a CNN trained on MNIST. GD focuses on dominant digit pixels, whereas $\ell_2$-SAM highlights minor background regions.
  • ...and 26 more figures

Theorems & Definitions (78)

  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.3: Interpretation of the Finite-time Blow-up
  • Remark 3.4: Interpretation of Exponential Growth
  • Corollary 3.4
  • Theorem 4.1
  • Theorem 4.2
  • Lemma 4.2
  • Theorem 4.3
  • Theorem 4.4
  • ...and 68 more