Table of Contents
Fetching ...

Memorization in Attention-only Transformers

Léo Dana, Muni Sreenivas Pydi, Yann Chevaleyre

TL;DR

Memorization in Attention-only Transformers investigates how an AoT can memorize information for next-token prediction under two tasks: exact association and distribution memorization. The authors formalize AoT as a sequence encoder and prove a constructive bound (Theorem) showing that, for any $\varepsilon$, an AoT with $d_h H + d \ge T_{\varepsilon}$ can approximate the target distribution with $d_{KL}(\pi,\mathcal{T})$ close to the best sequence encoder's divergence, while keeping a manageable parameter count. They reveal a fundamental bottleneck for distribution memorization tied to the embedding dimension $d$, and demonstrate that exact associative memorization scales as $T_0 = H d_h + d$ (or $H d_h + 2$ in a low-dimensional instantiation), supported by empirical scaling laws. Experiments show linear memory scaling in the number of heads $H$ and near-quadratic scaling in the head dimension $d_h$, with AoT exhibiting competitive memory efficiency versus MLP-based transformers under optimization constraints. Overall, the work clarifies how attention mechanisms contribute to memorization, introduces distribution-based memorization as a rigorous objective, and provides concrete bounds and scaling laws relevant for both theory and practice.

Abstract

Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.

Memorization in Attention-only Transformers

TL;DR

Memorization in Attention-only Transformers investigates how an AoT can memorize information for next-token prediction under two tasks: exact association and distribution memorization. The authors formalize AoT as a sequence encoder and prove a constructive bound (Theorem) showing that, for any , an AoT with can approximate the target distribution with close to the best sequence encoder's divergence, while keeping a manageable parameter count. They reveal a fundamental bottleneck for distribution memorization tied to the embedding dimension , and demonstrate that exact associative memorization scales as (or in a low-dimensional instantiation), supported by empirical scaling laws. Experiments show linear memory scaling in the number of heads and near-quadratic scaling in the head dimension , with AoT exhibiting competitive memory efficiency versus MLP-based transformers under optimization constraints. Overall, the work clarifies how attention mechanisms contribute to memorization, introduces distribution-based memorization as a rigorous objective, and provides concrete bounds and scaling laws relevant for both theory and practice.

Abstract

Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.

Paper Structure

This paper contains 21 sections, 12 theorems, 50 equations, 10 figures, 1 table.

Key Result

Proposition 1

Let $\mathcal{T}$ be any Transformer with embedding dimension $d$, dictionary size $N$ and context window $S$, and $\pi$ be any distribution, we have

Figures (10)

  • Figure 1: Scaling laws on $H$ and $d_h$. The embedding dimension is 10 and the dictionary size is 50. The blue dotted lines are the linear or quadratic least square approximation of the empirical accuracy.
  • Figure 2: Two different measurements of the constant $C(d,N)$. On the left when only $d$ varies, and on the right when $d=d_h$ varies.
  • Figure 3: Experiment 5. Accuracy scaling laws on the number of parameters for two models: an AoT and an MLP-based Transformer. Both of them have the same embedding dimension $d$.
  • Figure 4: Experiment 6. Scaling law for $N=10$, with the smallest dimension and head dimension possible $d=d_h=2$. We observe that our lower bound largely underestimates the memorization capacity of a single attention head, but that the scalings are almost identical, with a slope relative difference of $1.3$.
  • Figure 5: Experiment 1, larger dimension. We plotted the scaling law as in experiment 1, but with $N=200$, and $d=50$. As expected, the scaling remains linear.
  • ...and 5 more figures

Theorems & Definitions (19)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Corollary 2
  • Theorem 2
  • ...and 9 more