Benign Overfitting in Token Selection of Attention Mechanism
Keitaro Sakamoto, Issei Sato
TL;DR
This work analyzes how token selection in a one-layer attention mechanism behaves under label noise, revealing two regimes governed by the signal-to-noise ratio ${\rm SNR}$. By fixing the pretrained head and optimizing only the attention-related weights, the authors show that high ${\rm SNR}$ leads to not overfitting by disregarding noisy samples, while low ${\rm SNR}$ yields benign overfitting where noise memorization coexists with learning the class signal, with generalization improving only after an exponential-time phase akin to grokking. A two-stage analytical framework centered on attention gaps $\Lambda_{i,t}(\tau)$ and $\Gamma_{i,u}(\tau)$ and the growth function $g(x)=2x+2\sinh(x-\log T)$ underpins the theory, which is supported by synthetic experiments and real-world tests using ViT/BERT backbones with limited fine-tuning. The results illuminate the dynamics of token selection under label noise and offer guidance for parameter-efficient tuning strategies, with implications for robust downstream tasks and prompt-based learning.
Abstract
Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
