Table of Contents
Fetching ...

Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems

Bingcong Li, Liang Zhang, Niao He

TL;DR

This work develops a resource-efficient SAM variant, balancedness-aware regularization (BAR), tailored for scale-invariant problems such as finetuning language models with LoRA, and reveals that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive -- outliers have stronger impact.

Abstract

Sharpness-aware minimization (SAM) improves generalization of various deep learning tasks. Motivated by popular architectures such as LoRA, we explore the implicit regularization of SAM for scale-invariant problems involving two groups of variables. Instead of focusing on commonly used sharpness, this work introduces a concept termed balancedness, defined as the difference between the squared norm of two variables. This allows us to depict richer global behaviors of SAM. In particular, our theoretical and empirical findings reveal that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive -- outliers have stronger impact. The latter coincides with empirical observations that SAM outperforms SGD in the presence of outliers. Leveraging the implicit regularization, we develop a resource-efficient SAM variant, balancedness-aware regularization (BAR), tailored for scale-invariant problems such as finetuning language models with LoRA. BAR saves 95% computational overhead of SAM, with enhanced test performance across various tasks on RoBERTa, GPT2, and OPT-1.3B.

Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems

TL;DR

This work develops a resource-efficient SAM variant, balancedness-aware regularization (BAR), tailored for scale-invariant problems such as finetuning language models with LoRA, and reveals that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive -- outliers have stronger impact.

Abstract

Sharpness-aware minimization (SAM) improves generalization of various deep learning tasks. Motivated by popular architectures such as LoRA, we explore the implicit regularization of SAM for scale-invariant problems involving two groups of variables. Instead of focusing on commonly used sharpness, this work introduces a concept termed balancedness, defined as the difference between the squared norm of two variables. This allows us to depict richer global behaviors of SAM. In particular, our theoretical and empirical findings reveal that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive -- outliers have stronger impact. The latter coincides with empirical observations that SAM outperforms SGD in the presence of outliers. Leveraging the implicit regularization, we develop a resource-efficient SAM variant, balancedness-aware regularization (BAR), tailored for scale-invariant problems such as finetuning language models with LoRA. BAR saves 95% computational overhead of SAM, with enhanced test performance across various tasks on RoBERTa, GPT2, and OPT-1.3B.

Paper Structure

This paper contains 41 sections, 12 theorems, 64 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Theorem 1

When applying SGD on the NOP eq.prob-nop, the limiting flow with $\eta \rightarrow 0$ satisfies $\|\mathbf{x}_t\|^2 - \|\mathbf{y}_t\|^2 = \|\mathbf{x}_0\|^2 - \|\mathbf{y}_0\|^2$ for all $t > 0$. In other words, $\frac{\text{d} {\cal B}_t}{\text{d} t} = 0$ holds.

Figures (5)

  • Figure 1: Implicit regularization of SAM on balancedness. The losses for NOP and OP are $\mathbb{E}[ \| \mathbf{x}\mathbf{y}^\top - (\mathbf{A} + \alpha \mathbf{N}) \|^2]$ and $\mathbb{E}[ \| \mathbf{x}^\top \mathbf{y}- (a + \alpha n) \|^2]$, respectively. Here, $\mathbf{A}$ is the ground truth matrix, $\mathbf{N}$ is the Gaussian noise, and $\alpha$ controls the SNR. Left of (a) and (b): $| \| \mathbf{x}_t \|^2 - \| \mathbf{y}_t \|^2 |$ vs. iteration. Right of (a) and (b): $| \| \mathbf{g}_{\mathbf{x}_t} \|^2 - \| \mathbf{g}_{\mathbf{y}_t} \|^2 |$ vs. iteration, where $(\mathbf{g}_{\mathbf{x}_t}, \mathbf{g}_{\mathbf{y}_t})$ denotes stochastic gradients.
  • Figure 2: Implicit regularization of SAM on NOP $\mathbb{E}[ \| \mathbf{x}\mathbf{y}^\top - (\mathbf{A} + \alpha \mathbf{N}) \|^2]$, where $\alpha$ controls SNR. (a) the threshold of balancedness $\bar{\cal B}_t^\rho$ in Corollary \ref{['thm.sam-nop-balancing']}; (b) implicit vs. explicit regularization.
  • Figure 3: Implicit regularization of SAM on LoRA. We consider few shot learning with LoRA on a RoBERTa-large. For datasets RTE, SST-5, and MNLI, 1st, 12th and 24th query layers' $2| {\cal B}_{t,l} |$ are plotted, respectively. The layers are chosen to represent early, middle, and final stages of RoBERTa. The averaged $\bar{\cal B}_{t,l}^\rho$ in Corollary \ref{['thm.sam-nop-balancing']} is $0.37$, $0.21$, and $0.29$, respectively.
  • Figure 4: The value of $f(x,y)$. Once SGD reaches the dotted line, i.e., the hard constraint $|x|=|y|$, it can only converge to a saddle point $(0, 0)$.
  • Figure : SAM foret2021

Theorems & Definitions (23)

  • Theorem 1: arora2018arora2018convergenceji2018gradientahn2024learning
  • Theorem 2
  • Corollary 1
  • Theorem 3
  • Lemma 1
  • proof
  • Theorem 4
  • proof
  • proof
  • Corollary 2
  • ...and 13 more