Table of Contents
Fetching ...

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi

TL;DR

This work studies the problem of learning pairwise interactions in attention-style models by casting tokens as interacting particles in an IPS. It proves a dimension-free minimax rate of $M^{-{2\beta}/{(2\beta+1)}}$, depending only on the Hölder smoothness $\beta$ of the activation, and independent of embedding dimension $d$, number of tokens $N$, or weight-rank $r$, under a coercivity condition. A matching lower bound confirms optimality (up to logarithmic factors), and numerical experiments verify the dimension-free behavior and the dependence on $\beta$. The results offer theoretical insight into the statistical efficiency of attention mechanisms and motivate extensions to multi-head and more complex transformer components.

Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

TL;DR

This work studies the problem of learning pairwise interactions in attention-style models by casting tokens as interacting particles in an IPS. It proves a dimension-free minimax rate of , depending only on the Hölder smoothness of the activation, and independent of embedding dimension , number of tokens , or weight-rank , under a coercivity condition. A matching lower bound confirms optimality (up to logarithmic factors), and numerical experiments verify the dimension-free behavior and the dependence on . The results offer theoretical insight into the statistical efficiency of attention mechanisms and motivate extensions to multi-head and more complex transformer components.

Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is with being the sample size, depending only on the smoothness of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

Paper Structure

This paper contains 33 sections, 10 theorems, 120 equations, 3 figures, 1 table.

Key Result

Theorem 3.1

Suppose $rd \le (M/\log M)^{\frac{1}{2\beta +1}}$. Consider the estimator $\widehat{g}_{M}$ defined in eq:g_est1 computed on data $M$ i.i.d. observation satisfying Assumptions assumption:data_dist and assumption:noise_subG. Then, for $\widehat{g}_{M}$ defined in eq:g_est1 it holds that where $C_{N,L,\bar{a},\beta,s}=N[C_1^\beta \frac{L^2(s \bar{a})^{2\beta}}{(s !)^2}+C_2]$ for some universal pos

Figures (3)

  • Figure : (a)
  • Figure : (a)
  • Figure : (b)

Theorems & Definitions (18)

  • Definition 2.1: Exploration measure
  • Definition 2.2: Hölder classes
  • Definition 2.3: Interaction matrix class
  • Definition 2.4: Target function class
  • Definition 2.5: Estimator function class
  • Theorem 3.1
  • Remark 3.2
  • Remark 3.3
  • Lemma 3.4: Coercivity
  • Lemma 4.1
  • ...and 8 more