Table of Contents
Fetching ...

Memorization Capacity of Multi-Head Attention in Transformers

Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis

TL;DR

This work analyzes the memorization capacity of a single-layer multi-head attention module in Transformers under relaxed linear-independence assumptions, rather than traditional General Position. It proves that a layer with $H$ heads, embedding dimension $d$, head dimension $d_h$, value dimension $d_v$, and context size $n<d$ can memorize at least $Ω\left(H\min(n,d_h)\right)$ examples using $Θ(Hd(d_h+d_v))$ parameters, provided the query set has Kruskal rank $\ge n$ and each context matrix has rank $n$. The proof proceeds by constructing a high-rank intermediate representation via the attention maps and then solving a linear system to fix the readout layers, with softmax saturation playing a key role in allocating memorization across heads. Empirical validation on synthetic data and ViT-based experiments shows these assumptions hold in practice in many settings, and confirms the predicted linear-in-$H$ growth and the $d_h\le n$ regime for memorization gains. The results offer new theoretical insight into how attention layers distribute memorization across heads and contexts, with implications for privacy, interpretability, and efficient Transformer design.

Abstract

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n < d$, featuring $Θ(Hd^2)$ parameters, can memorize $Ω(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.

Memorization Capacity of Multi-Head Attention in Transformers

TL;DR

This work analyzes the memorization capacity of a single-layer multi-head attention module in Transformers under relaxed linear-independence assumptions, rather than traditional General Position. It proves that a layer with heads, embedding dimension , head dimension , value dimension , and context size can memorize at least examples using parameters, provided the query set has Kruskal rank and each context matrix has rank . The proof proceeds by constructing a high-rank intermediate representation via the attention maps and then solving a linear system to fix the readout layers, with softmax saturation playing a key role in allocating memorization across heads. Empirical validation on synthetic data and ViT-based experiments shows these assumptions hold in practice in many settings, and confirms the predicted linear-in- growth and the regime for memorization gains. The results offer new theoretical insight into how attention layers distribute memorization across heads and contexts, with implications for privacy, interpretability, and efficient Transformer design.

Abstract

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with heads, dimension , and context size , featuring parameters, can memorize examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.
Paper Structure (25 sections, 11 theorems, 72 equations, 5 figures, 3 tables)

This paper contains 25 sections, 11 theorems, 72 equations, 5 figures, 3 tables.

Key Result

Theorem 1

Consider a multi-head attention layer $\mathcal{A}$ with $H$ heads, embedding dimensions $d$, ${d_v}\geq {d_\text{out}} \geq 1$, and $d_h \geq 1$. Let $\mathcal{T} = \left\{\left(\bm{E}^{(t)}, {\bm{e}}^{(t)}, \bm{y}^{(t)} \right)\right\}_{t=1}^{T}$ be a training set with context size $n < d$. Define

Figures (5)

  • Figure 1: Testing Kruskal Rank of query tokens on the output of one layer Random Attention on ImageNet. The Kruskal Rank is only slightly larger than $n$ (Assumption \ref{['asp:q_gen_pos']}), and much smaller than $d$ (General Position).
  • Figure 2: Testing memorization as a function of number of heads (a), context size (b), and head size (c) for classification under Assumptions \ref{['asp:c_lin_indep']} and \ref{['asp:q_gen_pos']}. Examples generated synthetically with $d=64$ and shared context (see Proposition \ref{['prop:rank_upperbound_same_ctx']}). Memorization increases linearly with $H$, monotonically with $n$, and monotonically with $d_h$ as long as $d_h \leq n$.
  • Figure 3: Testing the saturation property of Softmax on synthetic data. On the top row, we have $H=8$, and on the bottom row $H=4$. (a) and (d) show whether softmax in each head is saturated for the first $20$ examples of the dataset, results are qualitatively similar for the rest of the examples. (b) and (e) show the histogram of the number of saturated heads for each example across the dataset ($H \times T$ in total). (c) and (f) show the histogram of the maximum softmax coefficient for each head and example across all the datasets ($H \times T$ in total). All the figures suggest that with perfect memorization, the majority of heads become saturated across the examples, with almost all examples having at least one non-saturated head.
  • Figure 4: Testing memorization under general position assumption for both context and query vectors. While the linearity in $H$ still remains, the linearity in $n$ no longer strictly holds, suggesting $H(d-1)+1$ to be a better alternative under general position assumption.
  • Figure 5: Testing memorization on regression task with a similar setting as Figure \ref{['fig:share_ctx_clf_plots']}, except for real-valued labels. The same results and conclusions in terms of monotonicity with $H$ and $n$ hold true for regression. The variance in regression plots is magnified in some cases due to the log scale of the plot.

Theorems & Definitions (22)

  • Definition 1: Kruskal Rank
  • Definition 2: General Position
  • Theorem 1
  • Proposition 1
  • Remark 1
  • Proposition 2
  • Proposition 3
  • Remark 2
  • Proposition 4
  • Claim 1: Fit $r$ examples in one head
  • ...and 12 more