Memorization in Attention-only Transformers
Léo Dana, Muni Sreenivas Pydi, Yann Chevaleyre
TL;DR
Memorization in Attention-only Transformers investigates how an AoT can memorize information for next-token prediction under two tasks: exact association and distribution memorization. The authors formalize AoT as a sequence encoder and prove a constructive bound (Theorem) showing that, for any $\varepsilon$, an AoT with $d_h H + d \ge T_{\varepsilon}$ can approximate the target distribution with $d_{KL}(\pi,\mathcal{T})$ close to the best sequence encoder's divergence, while keeping a manageable parameter count. They reveal a fundamental bottleneck for distribution memorization tied to the embedding dimension $d$, and demonstrate that exact associative memorization scales as $T_0 = H d_h + d$ (or $H d_h + 2$ in a low-dimensional instantiation), supported by empirical scaling laws. Experiments show linear memory scaling in the number of heads $H$ and near-quadratic scaling in the head dimension $d_h$, with AoT exhibiting competitive memory efficiency versus MLP-based transformers under optimization constraints. Overall, the work clarifies how attention mechanisms contribute to memorization, introduces distribution-based memorization as a rigorous objective, and provides concrete bounds and scaling laws relevant for both theory and practice.
Abstract
Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.
