Memorization Capacity of Multi-Head Attention in Transformers
Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis
TL;DR
This work analyzes the memorization capacity of a single-layer multi-head attention module in Transformers under relaxed linear-independence assumptions, rather than traditional General Position. It proves that a layer with $H$ heads, embedding dimension $d$, head dimension $d_h$, value dimension $d_v$, and context size $n<d$ can memorize at least $Ω\left(H\min(n,d_h)\right)$ examples using $Θ(Hd(d_h+d_v))$ parameters, provided the query set has Kruskal rank $\ge n$ and each context matrix has rank $n$. The proof proceeds by constructing a high-rank intermediate representation via the attention maps and then solving a linear system to fix the readout layers, with softmax saturation playing a key role in allocating memorization across heads. Empirical validation on synthetic data and ViT-based experiments shows these assumptions hold in practice in many settings, and confirms the predicted linear-in-$H$ growth and the $d_h\le n$ regime for memorization gains. The results offer new theoretical insight into how attention layers distribute memorization across heads and contexts, with implications for privacy, interpretability, and efficient Transformer design.
Abstract
Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n < d$, featuring $Θ(Hd^2)$ parameters, can memorize $Ω(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.
