Table of Contents
Fetching ...

Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

Mana Sakai, Ryo Karakida, Masaaki Imaizumi

TL;DR

This work establishes the infinite-width behavior of a single attention layer under realistic $1/\sqrt{n}$-scaling with finite heads, showing a non-Gaussian, hierarchical Gaussian limit driven by a random similarity score. Using the Tensor Programs framework, it derives limiting distributions for both Netsor-program variables and the dot-product scores, revealing that attention outputs are Gaussian conditional on the random scores, which themselves converge to Gaussians. The results reconcile finite-head attention with a precise non-Gaussian limit, and numerical experiments validate the theory at finite widths, including robustness to varying sequence lengths and activation functions. This analysis provides a foundational step toward a unified infinite-width theory of deep Transformer architectures, with potential implications for signal propagation, training dynamics, and feature learning in attention-based models.

Abstract

In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard $1/\sqrt{n}$-scaling with $n$ dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.

Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

TL;DR

This work establishes the infinite-width behavior of a single attention layer under realistic -scaling with finite heads, showing a non-Gaussian, hierarchical Gaussian limit driven by a random similarity score. Using the Tensor Programs framework, it derives limiting distributions for both Netsor-program variables and the dot-product scores, revealing that attention outputs are Gaussian conditional on the random scores, which themselves converge to Gaussians. The results reconcile finite-head attention with a precise non-Gaussian limit, and numerical experiments validate the theory at finite widths, including robustness to varying sequence lengths and activation functions. This analysis provides a foundational step toward a unified infinite-width theory of deep Transformer architectures, with potential implications for signal propagation, training dynamics, and feature learning in attention-based models.

Abstract

In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard -scaling with dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.

Paper Structure

This paper contains 36 sections, 18 theorems, 107 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Consider a $\textsc{Netsor}$ program, and suppose all nonlinearities used in Nonlin are pseudo-Lipschitz. We adopt the settings and notations from Assumption asmptn:Attention, which defines vectors $g^{1},\dots,g^{m}$ and scalar dot-products $p_{1},\dots,p_{r}$. Further, suppose all initial vectors where each $\varphi^{j}$ is a pseudo-Lipschitz function. Then, for any bounded and pseudo-Lipschitz

Figures (5)

  • Figure 1: Comparison of the distribution of the attention output $y_{1}^{1}$ and its infinite-width limit $Z^{y^{1}}$ in Example \ref{['eg:MultiHead']}. (a) Kernel density estimates of the empirical distribution (via Monte Carlo sampling) of $y_{1}^{1}$ for widths $n\in\{16,64,256,1024\}$ (dashed lines) alongside that of $Z^{y^{1}}$ (solid line), showing the convergence of the finite-width distribution to its limit. (b) Average of the log-KL divergence $\log\mathrm{KL}(\mathrm{Dist}(y_{1}^{1})\|\mathrm{Dist}(Z^{y^{1}}))$ over 10 independent trials, plotted against $\log_{4}(n)$ with error bars indicating one standard deviation, confirming a decreasing trend.
  • Figure 2: Visualization of the dot-product score $p_{1,1}^{(1)}$ and attention output $y_{1}^{1}$, as defined in Example \ref{['eg:MultiHead']}, comparing finite-width behavior to their infinite-width limits. (a) Histogram of the empirical distribution of $p_{1,1}^{(1)}$ for $n=256$ alongside the plot of its infinite-width limit distribution $\mathring{p}_{1,1}^{1}$, under two scaling schemes; $1/\sqrt{n}$ and $1/n$. The $1/n$-scaled score collapses to zero in the infinite-width limit, while the $1/\sqrt{n}$-scaled score retains a nondegenerate distribution. (b) Kernel density estimates of the empirical distribution of $y_{1}^{1}$ for $n=256$ (dashed lines) alongside the plot of its infinite-width limit distribution $Z^{y^{1}}$ (solid lines), varying head counts $H\in\{1,256\}$. The black solid line represents the density of the infinite-head limit distribution from hron2020infinite. This demonstrates that our theoretical prediction remains accurate even when $H$ grows, and it approaches the infinite-head limit.
  • Figure 3: Comparison of the distribution of the attention output $y_{1}^{1}$ and its infinite-width limit $Z^{y}$ under the low-rank setting. (a) Kernel density estimates of the empirical distribution (via Monte Carlo sampling) of $y_{1}^{1}$ for various widths $n$ and head counts $H$ (with $n_H=n/H=64$ is fixed, dashed lines) alongside that of $Z^{y}$ (solid lines). (b) Average of the log-KL divergence $\log\mathrm{KL}(\mathrm{Dist}(y_{1}^{1})\|\mathrm{Dist}(Z^{y^{1}}))$ over 10 independent trials, plotted against $\log_{4}(n)$ with error bars indicating one standard deviation.
  • Figure 4: Comparison of the distribution of the attention output $y_{1}^{1}$ and its infinite-width limit $Z^{y^{1}}$ when $s=8$. (a) Kernel density estimates of the empirical distribution (via Monte Carlo sampling) of $y_{1}^{1}$ for widths $n\in\{16,64,256,1024\}$ (dashed lines) alongside that of $Z^{y^{1}}$ (solid line). (b) Average of the log-KL divergence $\log\mathrm{KL}(\mathrm{Dist}(y_{1}^{1})\|\mathrm{Dist}(Z^{y^{1}}))$ over 10 independent trials, plotted against $\log_{4}(n)$ with error bars indicating one standard deviation.
  • Figure 5: Comparison of the distribution of the attention output $y_{1}^{1}$ and its infinite-width limit $Z^{y}$ with ReLU activation function. (a) Kernel density estimates of the empirical distribution (via Monte Carlo sampling) of $y_{1}^{1}$ for various widths $n$ and head counts $H$ (with $n_H=n/H=64$ is fixed, dashed lines) alongside that of $Z^{y}$ (solid lines). (b) Average of the log-KL divergence $\log\mathrm{KL}(\mathrm{Dist}(y_{1}^{1})\|\mathrm{Dist}(Z^{y^{1}}))$ over 10 independent trials, plotted against $\log_{4}(n)$ with error bars indicating one standard deviation.

Theorems & Definitions (36)

  • Definition 3.1: Limiting Distribution
  • Theorem 3.1
  • Corollary 3.2: Coordinatewise Convergence
  • Example 3.1: Multi-Head Attention
  • Remark 3.1
  • Remark 3.2
  • Theorem 4.1
  • Lemma C.1
  • proof
  • Lemma C.2: Portmanteau lemma (Lemma 2.2 in vaart1998asymptotic)
  • ...and 26 more