Understanding Factual Recall in Transformers via Associative Memories
Eshaan Nichani, Jason D. Lee, Alberto Bietti
TL;DR
This work analyzes how transformers memorize facts by leveraging associative memories, showing that shallow transformers can achieve near-optimal factual recall by combining linear or MLP-based associative memories with attention. The authors prove storage capacity scales linearly with parameter count and demonstrate that a one-layer transformer can achieve 100% recall on a synthetic task when either the self-attention or the MLP parameter counts scale nearly linearly with the dataset size SR (up to logarithmic factors). They additionally study gradient dynamics, revealing a sequential learning trajectory with an intermediate hallucination stage where the model relies on relations before subjects, and provide information-theoretic lower bounds that match their constructions up to log factors. Empirical validations corroborate the theory, showing linear scaling of memory capacity with model size and memory trade-offs between attention and MLP components. Overall, the paper advances understanding of memorization and factual recall mechanisms in transformers and suggests concrete architectural principles for scalable memory in neural networks.
Abstract
Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.
