Table of Contents
Fetching ...

Benign Overfitting in Single-Head Attention

Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, Gal Vardi

TL;DR

The paper investigates benign overfitting for a single-head softmax attention mechanism in a high-dimensional, noisy-label setting. It shows that gradient descent on logistic loss attains exact interpolation after two iterations when the signal-to-noise ratio scales as O(1/√n), while maintaining near-optimal generalization. It extends the result to min-norm / max-margin interpolators, establishing similar benign behavior under the same SNR, and demonstrates the tightness of the SNR threshold in two-token scenarios. Complementary experiments validate the theoretical predictions, including attention allocation that favors signal tokens for clean data and noise tokens for noisy data. This work provides a foundational step toward understanding overfitting in attention mechanisms central to Transformers and suggests directions for more complex architectures and training dynamics.

Abstract

The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.

Benign Overfitting in Single-Head Attention

TL;DR

The paper investigates benign overfitting for a single-head softmax attention mechanism in a high-dimensional, noisy-label setting. It shows that gradient descent on logistic loss attains exact interpolation after two iterations when the signal-to-noise ratio scales as O(1/√n), while maintaining near-optimal generalization. It extends the result to min-norm / max-margin interpolators, establishing similar benign behavior under the same SNR, and demonstrates the tightness of the SNR threshold in two-token scenarios. Complementary experiments validate the theoretical predictions, including attention allocation that favors signal tokens for clean data and noise tokens for noisy data. This work provides a foundational step toward understanding overfitting in attention mechanisms central to Transformers and suggests directions for more complex architectures and training dynamics.

Abstract

The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.

Paper Structure

This paper contains 30 sections, 55 theorems, 469 equations, 9 figures, 1 table.

Key Result

Theorem 5

Suppose that Assumption assumption: gd holds. Then, with probability at least $1-\delta$ over the training dataset, after two iterations of GD we have:

Figures (9)

  • Figure 1: The left panel shows the train and test accuracies during training. It shows that benign overfitting occurs after $2$ iterations. After the first iteration, the model correctly classifies the clean training examples, but not the noisy ones. In the right panel, we show the softmax probability of the signal token for clean and noisy samples (average of the softmax probabilities $s_{j,1}^{t}$ over $\mathcal{C}$ and $\mathcal{N}$ respectively). We see that after $2$ iterations, the attention focuses on signal tokens for clean examples, and on noise tokens for noisy examples. This aligns with Theorem \ref{['thm: gd-after-2-iteration']} and Remark \ref{['remark: sm_probability_c_rho_is_const']}. Parameters: $n=200, d=40000, T=2, \beta = 0.025, \rho = 30, \eta = 0.05, \text{test sample size} = 2000$.
  • Figure 2: A heatmap of the test accuracy (averaged over $5$ runs) after achieving training accuracy $100\%$, plotted across varying signal-to-noise ratios (SNR) and sample sizes ($n$). Yellow indicates small test acc, while blue represents high test acc. The red curves represent the expression $\text{SNR}^2 = 2.1/n$. This validates our tight bound of SNR = $\Omega(1/\sqrt{n})$ to achieve benign overfitting, and with a smaller SNR the model exhibits harmful overfitting. Parameters: $d=900, T=5, \beta = 0.015, \eta = 0.1, \text{test sample size} = 2000$.
  • Figure 3: The left panel shows train and test accuracies during training with a small step size. The clean training samples are correctly classified already after one iteration, but in contrast to Theorem \ref{['thm: gd-after-2-iteration']} and Figure \ref{['fig:gd_two_steps']}, benign overfitting occurs after about $150$ iterations. In the right panel, we see that the attention starts separating signal and noise tokens shortly before benign overfitting occurs. Parameters: $n=200, d=40000, T=2, \beta = 0.0001, \rho = 30, \eta = 0.05, \text{test sample size} = 2000$.
  • Figure 4: Comparing train (solid lines) and test (dashed lines) accuracies with different dimensions. Here, we see that for small $d$ (purple line), the model is unable to fit the data (at least in the first $10^5$ first iterations), and both the train and test accuracies are at the noise-rate level. For intermediate values of $d$ (green and blue lines), the model exhibits harmful overfitting, and for larger $d$ (yellow line) the model exhibits benign overfitting. We note that benign overfitting occurs here for $d=2n \ll n^2$, which suggests that the assumptions on $d$ in our theorems are loose. Parameters: $n=500, \beta = 0.02, T=5, \rho = 30, \eta = 0.1, \text{test sample size} = 10000$.
  • Figure 5: Self-attention experiments. The model: $\bm{X} \rightarrow \bm{v}^{\top}\bm{X}^T \mathbb{S} (\bm{X}\boldsymbol{W}\bm{x}^{(1)})$, same as vasudeva2024implicit. The left panel shows the train and test accuracies during training. It shows that benign overfitting also occurs after $2$ iterations. In the right panel, we show the softmax probability of the signal token for clean and noisy samples (average of the softmax probabilities $s_{j,1}^{t}$ over $\mathcal{C}$ and $\mathcal{N}$ respectively). We see that after $2$ iterations, the attention focuses on signal tokens for clean examples, and on noise tokens for noisy examples. This indicates that our results also capture the behavior in a self-attention mechanism. Parameters: $n=200, d=40000,$T=2$, \beta = 0.025, \rho = 20, \eta = 0.05, \text{test sample size} = 2000$.
  • ...and 4 more figures

Theorems & Definitions (133)

  • Definition 1: clean data distribution
  • Definition 2: noisy data distribution
  • Remark 4: random initialization
  • Theorem 5
  • Remark 6
  • Theorem 8
  • Remark 9
  • Theorem 10
  • Theorem 12
  • Remark 13
  • ...and 123 more