Paying Attention to Facts: Quantifying the Knowledge Capacity of Attention Layers
Liang Ze Wong
TL;DR
The paper addresses how a single-layer attention-only transformer can memorize facts from databases by introducing a tensor-based, linear-algebraic framework. It defines a database tensor $D$ and analyzes its rank as a size measure, then builds an attention-layer tensor $L$ whose rank bounds depend on factors like $d_{model}$, $n_{heads}$, and $d_{head,vo}$, while showing that $d_{head,qk}$ plays a lesser role. The authors also examine how argmax and softmax manipulations can distort rank, proposing a $\mathrm{softmax}_{\geq \tau}$ variant to relate memorization to rank, and validating these ideas with experiments on toy data that reveal how capacity can be expanded without adding parameters. Overall, the work provides a theoretical lens on factual recall in transformers and suggests practical avenues to increase knowledge capacity via architectural choices, with implications for interpretability and reducing hallucinations.
Abstract
In this paper, we investigate the ability of single-layer attention-only transformers (i.e. attention layers) to memorize facts contained in databases from a linear-algebraic perspective. We associate with each database a 3-tensor, propose the rank of this tensor as a measure of the size of the database, and provide bounds on the rank in terms of properties of the database. We also define a 3-tensor corresponding to an attention layer, and empirically demonstrate the relationship between its rank and database rank on a dataset of toy models and random databases. By highlighting the roles played by the value-output and query-key weights, and the effects of argmax and softmax on rank, our results shed light on the `additive motif' of factual recall in transformers, while also suggesting a way of increasing layer capacity without increasing the number of parameters.
