Table of Contents
Fetching ...

On the Capacity of Self-Attention

Micah Adler

TL;DR

The paper formalizes the capacity of self-attention via Relational Graph Recognition (RGR), modeling edge recovery among $m$ items with $m'$ directed edges under a total key dimension budget $D_K=h d_k$. It derives a tight capacity law, $D_K = \Theta\left(\dfrac{m' \log m'}{d_{\text{model}}}\right)$, showing this is both necessary and sufficient for broad graph families and contexts, and provides explicit constructions and empirical validation. A central insight is a capacity-based rationale for multi-head attention: distributing budget across many small heads mitigates interference when embeddings are compressed, increasing the number of recoverable relations even when each item attends to a single target. The work combines information‑theoretic lower bounds with constructive algorithms under Gaussian and restricted‑incoherence embedding models, and confirms predictions with controlled experiments, yielding a principled design rule for allocating key–query budget across heads in self‑attention systems. These results illuminate when and how self‑attention capacity is achieved and offer a concrete, falsifiable framework to guide architecture choices in capacity‑constrained settings.

Abstract

While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget? To formalize this, we introduce Relational Graph Recognition (RGR), where the key-query channel represents a graph on $m$ items with $m'$ directed edges, and, given a context of items, must recover the neighbors of each item. We measure resources by the total key dimension $D_K = h\,d_k$. Within this framework, we analytically derive a capacity scaling law and validate it empirically. We show that $D_K = Θ(m' \log m' / d_{\text{model}})$ is both necessary (information-theoretic lower bound) and sufficient (explicit construction) in a broad class of graphs to recover $m'$ relations. This scaling law directly leads to a new, capacity-based rationale for multi-head attention that applies even when each item only attends to a single target. When embeddings are uncompressed ($m = d_{\text{model}}$) and the graph is a permutation, a single head suffices. However, compression ($m > d_{\text{model}}$) forces relations into overlapping subspaces, creating interference that a single large head cannot disentangle. Our analysis shows that allocating a fixed $D_K$ across many small heads mitigates this interference, increasing the number of recoverable relations. Controlled single-layer experiments mirror the theory, revealing a sharp performance threshold that matches the predicted capacity scaling and confirms the benefit of distributing $D_K$ across multiple heads. Altogether, these results provide a concrete scaling law for self-attention capacity and a principled design rule for allocating key-query budget across heads.

On the Capacity of Self-Attention

TL;DR

The paper formalizes the capacity of self-attention via Relational Graph Recognition (RGR), modeling edge recovery among items with directed edges under a total key dimension budget . It derives a tight capacity law, , showing this is both necessary and sufficient for broad graph families and contexts, and provides explicit constructions and empirical validation. A central insight is a capacity-based rationale for multi-head attention: distributing budget across many small heads mitigates interference when embeddings are compressed, increasing the number of recoverable relations even when each item attends to a single target. The work combines information‑theoretic lower bounds with constructive algorithms under Gaussian and restricted‑incoherence embedding models, and confirms predictions with controlled experiments, yielding a principled design rule for allocating key–query budget across heads in self‑attention systems. These results illuminate when and how self‑attention capacity is achieved and offer a concrete, falsifiable framework to guide architecture choices in capacity‑constrained settings.

Abstract

While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget? To formalize this, we introduce Relational Graph Recognition (RGR), where the key-query channel represents a graph on items with directed edges, and, given a context of items, must recover the neighbors of each item. We measure resources by the total key dimension . Within this framework, we analytically derive a capacity scaling law and validate it empirically. We show that is both necessary (information-theoretic lower bound) and sufficient (explicit construction) in a broad class of graphs to recover relations. This scaling law directly leads to a new, capacity-based rationale for multi-head attention that applies even when each item only attends to a single target. When embeddings are uncompressed () and the graph is a permutation, a single head suffices. However, compression () forces relations into overlapping subspaces, creating interference that a single large head cannot disentangle. Our analysis shows that allocating a fixed across many small heads mitigates this interference, increasing the number of recoverable relations. Controlled single-layer experiments mirror the theory, revealing a sharp performance threshold that matches the predicted capacity scaling and confirms the benefit of distributing across multiple heads. Altogether, these results provide a concrete scaling law for self-attention capacity and a principled design rule for allocating key-query budget across heads.

Paper Structure

This paper contains 67 sections, 13 theorems, 74 equations, 8 figures, 2 tables, 4 algorithms.

Key Result

Lemma 4.1

Fix parameters $\{(W_Q^{(k)},W_K^{(k)})\}_{k=1}^h$ and a threshold $\tau$. Let the aggregated score be $S_{ij}^{\max}:=\max_k S_{ij}^{(k)}$. If, simultaneously for all $i\in V$, then for every context $\mathcal{C}\subseteq V$ and every source $i\in\mathcal{C}$: (i) if $\pi(i)\in\mathcal{C}$ then $S_{i,\pi(i)}^{\max}>\tau$ and $S_{ij}^{\max}<\tau$ for all $j\in\mathcal{C}\!\setminus\!\{\pi(i)\}$;

Figures (8)

  • Figure 1: Example F1–$D_K$ curve. Each line is a fixed number of heads. A single head has significantly worse performance than multiple heads. Error bars are 95% CIs over 10 runs.
  • Figure 2: More heads are needed as $m$ grows and as $d_{\text{model}}$ shrinks. See App. \ref{['app:exp-details-results']} for error bar description.
  • Figure 3: Comparison of $D_K^\star$ to theoretical scaling law. $x$-axis: scaling law prediction. $y$-axis: observed behavior. See App. \ref{['app:exp-details-results']} for error bar description.
  • Figure 4: Fixed compression diagonal with $r=8$. Line (left axis): minimum $D_K^\star$ achieving F1$\ge.99$. Bars (right axis): $h$ achieving that minimum. See App. \ref{['app:exp-details-results']} for error bar description.
  • Figure 5: Minimum total key dimension$D_K^\star$. Upper right and lower left numbers represent confidence range; methodology described in the text.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Lemma 4.1: Context‑robustness
  • proof
  • Theorem 4.2: Correctness of the multi‑head construction under Gaussian unit‑norm embeddings
  • Definition 7.1: Constant‑margin recovery
  • Theorem 7.2: Description‑length lower bound for QK
  • proof
  • Lemma 7.3: Lipschitz property of the Decision Function
  • Theorem B.1: Single-head recognition under one-hot inputs
  • proof
  • Theorem B.2: Multi‑head recognition under Gaussian unit‑norm embeddings
  • ...and 14 more