Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi
TL;DR
This work studies the problem of learning pairwise interactions in attention-style models by casting tokens as interacting particles in an IPS. It proves a dimension-free minimax rate of $M^{-{2\beta}/{(2\beta+1)}}$, depending only on the Hölder smoothness $\beta$ of the activation, and independent of embedding dimension $d$, number of tokens $N$, or weight-rank $r$, under a coercivity condition. A matching lower bound confirms optimality (up to logarithmic factors), and numerical experiments verify the dimension-free behavior and the dependence on $\beta$. The results offer theoretical insight into the statistical efficiency of attention mechanisms and motivate extensions to multi-head and more complex transformer components.
Abstract
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
