Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker; Xiong Wang; Fei Lu; Inbar Seroussi

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi

TL;DR

This work studies the problem of learning pairwise interactions in attention-style models by casting tokens as interacting particles in an IPS. It proves a dimension-free minimax rate of $M^{-{2\beta}/{(2\beta+1)}}$, depending only on the Hölder smoothness $\beta$ of the activation, and independent of embedding dimension $d$, number of tokens $N$, or weight-rank $r$, under a coercivity condition. A matching lower bound confirms optimality (up to logarithmic factors), and numerical experiments verify the dimension-free behavior and the dependence on $\beta$. The results offer theoretical insight into the statistical efficiency of attention mechanisms and motivate extensions to multi-head and more complex transformer components.

Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

TL;DR

This work studies the problem of learning pairwise interactions in attention-style models by casting tokens as interacting particles in an IPS. It proves a dimension-free minimax rate of

, depending only on the Hölder smoothness

of the activation, and independent of embedding dimension

, number of tokens

, or weight-rank

, under a coercivity condition. A matching lower bound confirms optimality (up to logarithmic factors), and numerical experiments verify the dimension-free behavior and the dependence on

. The results offer theoretical insight into the statistical efficiency of attention mechanisms and motivate extensions to multi-head and more complex transformer components.

Abstract

with

being the sample size, depending only on the smoothness

of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

TL;DR

Abstract

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)