Attention layers provably solve single-location regression
Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer
TL;DR
This work tackles the theoretical understanding of attention mechanisms in the presence of token sparsity by formulating single-location regression, where the output depends on one latent token among a sequence. It introduces a dedicated predictor that mirrors a simplified, nonlinear self-attention layer and proves its asymptotic Bayes optimality, while also analyzing the non-convex training dynamics via projected gradient descent. The results show that the oracle predictor attains Bayes-optimal performance in a high-dimensional regime, whereas linear predictors fail when the relevant location is latent, highlighting the distinct advantages of attention-like architectures. The findings illuminate how Transformers can store and utilize sparse token information through internal linear representations, with implications for interpretability and extensions to more complex sparse-sequence tasks in NLP and time-series analysis.
Abstract
Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.
