Benign Overfitting in Token Selection of Attention Mechanism

Keitaro Sakamoto; Issei Sato

Benign Overfitting in Token Selection of Attention Mechanism

Keitaro Sakamoto, Issei Sato

TL;DR

This work analyzes how token selection in a one-layer attention mechanism behaves under label noise, revealing two regimes governed by the signal-to-noise ratio ${\rm SNR}$. By fixing the pretrained head and optimizing only the attention-related weights, the authors show that high ${\rm SNR}$ leads to not overfitting by disregarding noisy samples, while low ${\rm SNR}$ yields benign overfitting where noise memorization coexists with learning the class signal, with generalization improving only after an exponential-time phase akin to grokking. A two-stage analytical framework centered on attention gaps $\Lambda_{i,t}(\tau)$ and $\Gamma_{i,u}(\tau)$ and the growth function $g(x)=2x+2\sinh(x-\log T)$ underpins the theory, which is supported by synthetic experiments and real-world tests using ViT/BERT backbones with limited fine-tuning. The results illuminate the dynamics of token selection under label noise and offer guidance for parameter-efficient tuning strategies, with implications for robust downstream tasks and prompt-based learning.

Abstract

Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.

Benign Overfitting in Token Selection of Attention Mechanism

TL;DR

This work analyzes how token selection in a one-layer attention mechanism behaves under label noise, revealing two regimes governed by the signal-to-noise ratio

. By fixing the pretrained head and optimizing only the attention-related weights, the authors show that high

leads to not overfitting by disregarding noisy samples, while low

yields benign overfitting where noise memorization coexists with learning the class signal, with generalization improving only after an exponential-time phase akin to grokking. A two-stage analytical framework centered on attention gaps

and

and the growth function

underpins the theory, which is supported by synthetic experiments and real-world tests using ViT/BERT backbones with limited fine-tuning. The results illuminate the dynamics of token selection under label noise and offer guidance for parameter-efficient tuning strategies, with implications for robust downstream tasks and prompt-based learning.

Abstract

Paper Structure (85 sections, 36 theorems, 302 equations, 13 figures, 6 tables)

This paper contains 85 sections, 36 theorems, 302 equations, 13 figures, 6 tables.

Introduction
Contribution.
Related Work
Token-selection in Attention.
Benign Overfitting.
Problem Setting
Notations
Data Model
Attention Model
Gradient-descent Training
Assumption on Parameters
Main Results
Key Techniques
Experiments
Synthetic experiments.
...and 70 more sections

Key Result

Theorem 4.1

Suppose that the norm of the linear head scales as $\|{\bm{\nu}}\|_2 = O(1 / \|{\bm{\mu}}\|_2)$. Under the parameter assumptions in sec:assumption, we have

Figures (13)

Figure 1: Projection of one selected token per sequence. Each point indicates ${\bm{x}}_t^{(i)}$ selected by attention from each input ${\bm{X}}^{(i)} = ({\bm{x}}_1^{(i)}, \ldots, {\bm{x}}_T^{(i)})^\top$ in the direction of class signals ${\bm{\mu}}_{+1}$ and ${\bm{\mu}}_{-1}$ for the three scenarios of harmful overfitting, benign overfitting, and not overfitting. Top: Training data with label noise. Bottom: Test data. The decision boundary is common because the head ${\bm{\nu}}$ is fixed, but the model can select an appropriate token that belongs to the desired output region. Here, $\|{\bm{\mu}}\|_2$ denotes the strength of the class signal, $d$ is the dimension of the data and parameters, $\sigma_\epsilon^2$ is the variance of the input noise, and $n$ is the size of the training set.
Figure 2: Illustration for the training dynamics of the probability assigned to relevant tokens $s_1$ in clean data $i \in {\mathcal{C}}$ and noisy data $j \in {\mathcal{N}}$. The y-axis shows $s_1(\tau)(1-s_1(\tau))$, which determines the magnitude of the gradient descent update. This value converges to $0$ as $s_1(\tau)$ approaches $0$ or $1$, and consequently, the contribution of this training example to the gradient update diminishes. The middle figure corresponds to the not-overfitting case in \ref{['thm:convergence']}, and the right figure represents the benign overfitting case.
Figure 3: Large noise setting: $d = 5000$, $\|{\bm{\mu}}\|_2 = 5$. Final training accuracy is $\mathbf{1.0}$ and test accuracy is $\mathbf{0.87}$ (Harmful overfitting).
Figure 4: Balanced setting: $d = 2000$, $\|{\bm{\mu}}\|_2 = 20$. Final training accuracy is $\mathbf{1.0}$ and test accuracy is $\mathbf{1.0}$ (Benign overfitting).
Figure 5: Large signal setting: $d = 1000$, $\|{\bm{\mu}}\|_2 = 100$. Final training accuracy is $\mathbf{0.8}$ and test accuracy is $\mathbf{1.0}$ (Not overfitting).
...and 8 more figures

Theorems & Definitions (78)

Definition 3.1
Remark 3.2: Weakly relevant token and label noise
Remark 3.3: Relevance to practical scenarios
Theorem 4.1
Remark 4.2: Harmful Overfitting
Remark 4.3: Implication for Grokking
Definition 4.4: Attention gap
Lemma 4.5: Attention gap dynamics of clean data
Lemma 4.6: Attention gap dynamics of noisy data in Stage 1
Lemma 4.7: Attention gap dynamics of noisy data in Stage 2
...and 68 more

Benign Overfitting in Token Selection of Attention Mechanism

TL;DR

Abstract

Benign Overfitting in Token Selection of Attention Mechanism

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (78)