Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon; Dongkuk Si; Chulhee Yun

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon, Dongkuk Si, Chulhee Yun

TL;DR

The theoretical analysis attributes this phenomenon to Sharpness-Aware Minimization's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient.

Abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

TL;DR

Abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training

-layer linear diagonal networks on linearly separable binary classification. For linear models (

), both

- and

-SAM recover the

max-margin classifier, matching gradient descent (GD). However, for depth

, the behavior changes drastically -- even on a single-example dataset. For

-SAM, the limit direction depends critically on initialization and can converge to

or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For

-SAM, we show that although its limit direction matches the

max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to

-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

Paper Structure (89 sections, 43 theorems, 354 equations, 31 figures)

This paper contains 89 sections, 43 theorems, 354 equations, 31 figures.

Introduction
Summary of Our Contributions
Related Work
Preliminaries
SAM with l infinity-Perturbations
Depth-1 Networks
Deeper Networks (L >=2)
SAM with l2-Perturbations: Sequential Feature Amplification
Asymptotic Behavior on Depth-1 and Depth-2 Networks
Pre-asymptotic Behavior on Depth-2 Networks
Sequential Feature Amplification
Understanding the Effect of l2-SAM
Analysis of Time-wise Sequential Feature Amplification
Analysis of Initialization-wise Sequential Feature Amplification
Experiments
...and 74 more sections

Key Result

Theorem 3.1

For almost every dataset which is linearly separable, any perturbation radius $\rho$ and any initialization, consider the linear model $f({\bm{x}})=\langle {\bm{w}}, {\bm{x}} \rangle$ trained with logistic loss. Then, $\ell_\infty$-SAM flow directionally converges in the $\ell_2$ max-margin directio

Figures (31)

Figure 1: Trajectories of the predictor ${\bm{\beta}}(t) \in\mathbb{R}^2$ from identical initial conditions under discrete GD, $\ell_\infty$-SAM and $\ell_2$-SAM on $\{ ({\bm{\mu}}, +1)\}$ with ${\bm{\mu}} = (1,2)$. We used $\eta=0.3$ and $\rho=1$ for SAM.
Figure 2: Trajectories $\beta(t)$ from identical initializations under GF and $\ell_\infty$-SAM flow with $d=2$ and ${\bm{\mu}} = (1,2)$. For SAM, $\rho=1$.
Figure 3: Rescaled $\ell_2$-SAM flow on $\mathcal{D}_{\bm{\mu}}$ with ${\bm{\mu}}= (4,5,6,7,8) \in \mathbb{R}^5$ and $\rho=1$.
Figure 4: Loss curves of GD (left) and $\ell_2$-SAM (right) on a 2-layer diagonal network in Regime 2 ($\alpha=0.35$, $\mu = (1,2,3,4,5,6)$, $\rho=0.1$). Colored regions mark the coordinate with highest growth.
Figure 5: Grad-CAM comparison of GD and $\ell_2$-SAM on a CNN trained on MNIST. GD focuses on dominant digit pixels, whereas $\ell_2$-SAM highlights minor background regions.
...and 26 more figures

Theorems & Definitions (78)

Theorem 3.1
Theorem 3.2
Remark 3.3: Interpretation of the Finite-time Blow-up
Remark 3.4: Interpretation of Exponential Growth
Corollary 3.4
Theorem 4.1
Theorem 4.2
Lemma 4.2
Theorem 4.3
Theorem 4.4
...and 68 more

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

TL;DR

Abstract

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (31)

Theorems & Definitions (78)