Table of Contents
Fetching ...

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborová

TL;DR

A principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location, and an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters.

Abstract

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

TL;DR

A principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location, and an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters.

Abstract

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.

Paper Structure

This paper contains 45 sections, 8 theorems, 112 equations, 7 figures, 1 table.

Key Result

Proposition 4.1

Let $L\sim P_L$ and, conditionally on $L$, $\epsilon\sim\mathop{\mathrm{Unif}}\nolimits(\{1,\ldots,L\})$ and $\chi\sim\mathcal{N}(0,I_L)$. Then the Bayes risk is given by

Figures (7)

  • Figure 1: shen2024scaling: Comparison of softmax Transformer (LLaMA, bolded line) with kernelized attention (TNL, cos2) and state-space models (HGRN2), as a function of the model size, and for various tasks (retrieval tasks on the left, linguistic proficiency on the right). All architectures have similar performance for the linguistic proficiency tasks, whereas in retrieval tasks the softmax attention systematically outperforms alternatives.
  • Figure 2: Minimal population risk $\mathsf{E}_\sigma$ over $\mathcal{F}_\sigma$ for different attention activations $\sigma$ (colors), compared to the Bayes risk $\mathcal{E}_\mathrm{Bayes}$\ref{['eq:mseBoPop']} (black), for the two tasks spiked-SLR (top) and max-SLR (bottom). Softmax is the only one achieving the Bayes risk. The markers on the lines are for readability only. Population risks are computed via numerical optimization of \ref{['eq:risk_manifold']}. In all cases, we found that $R_{kk} = R_{vv} = 0$ was optimal, i.e. $k$ exactly aligns with $k^*$ and $v$ with $v^*$.
  • Figure 3: Minimal test risk of the attention (linear vs. softmax) across different tasks and signal strengths $\nu$, for $L=3$. Linear attention is shown in red and softmax in blue. Solid lines indicate $\mathsf{E}_\sigma(\alpha)$ at finite $\alpha$ (Result \ref{['res:mseAtt']}), while markers represent the test risk of an ERM obtained via a local optimization method with $\sqrt{ND} = 10^4$. The regularizations $r_k$ and $r_v$ are tuned by grid search to minimize the test risk, as detailed in Appendix \ref{['secApp:numérique']}. Dotted and dashed lines correspond to the value of $\mathsf{E}_\sigma$ in the infinite-$\alpha$ limit (see closed-formed formulas in Proposition \ref{['res:mseBoPop']} for softmax and Appendix \ref{['secApp:perteAttLinéaire']} for linear). The Bayes-optimal risk is shown in black (see \ref{['secApp:bo']} for a discussion on its discontinuity). Appendices \ref{['secApp:numérique']}-\ref{['secApp:figAdd']} include more experimental details and results.
  • Figure 4: Values of the order parameters $m_{kv^*}, m_{vk^*}, m_{vv^*}, R_{kk}, R_{kv}$ and $R_{vv}$ obtained after convergence of the gradient flow. The mean, max and min are taken over the independent runs. The initial noise is $\bar{\eta}=0.1$ and there are at least twenty independent runs.
  • Figure 5: Minimal population risk $\mathsf{E}_\sigma$, reached at the well matched minimum, and population risk of the mismatched minimum ${\beth}_\sigma$, for different attention activations $\sigma$ (colors), for the two tasks spiked-SLR and max-SLR at $L=10$. The markers on the lines are for readability only. Population risks are computed by numerical optimization of $\tilde{\mathcal{E}}_\sigma$ from a random initialization (well matched minimum) or from the mismatched initialization described in section \ref{['subsec:other-possible-minima']} (mismatched minimum).
  • ...and 2 more figures

Theorems & Definitions (9)

  • Remark 4.1
  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition 4.4
  • Corollary 4.1
  • Corollary 4.2
  • Corollary 4.3
  • Corollary 4.4