Table of Contents
Fetching ...

Constructing Efficient Fact-Storing MLPs for Transformers

Owen Dugan, Roberto Garcia, Ronny Junkins, Jerry Liu, Dylan Zinsley, Sabri Eyuboglu, Atri Rudra, Chris Ré

TL;DR

The paper advances a constructive framework for storing factual knowledge in MLPs embedded in Transformers, addressing limitations of prior constructions by handling broad input/output geometries, achieving information-theoretically near-optimal parameter efficiency, and enabling direct use for factual recall. It introduces a decodability metric ρ based on value-embedding geometry, and an encoder–decoder architecture with both gradient-based and closed-form variants, complemented by embedding whitening to boost capacity. The authors demonstrate that their MLPs are usable inside Transformer blocks and reveal a capacity–usability tradeoff, with Lipschitz-constant metrics predicting recall usability. As a practical demonstration, they implement modular fact editing in a 1-layer transformer by swapping entire MLPs, achieving substantial recall accuracy with minimal disruption to non-fact tokens. Overall, the work provides a principled, scalable path toward interpretable, parameter-efficient knowledge storage and manipulation in LLMs.

Abstract

The success of large language models (LLMs) can be attributed in part to their ability to efficiently store factual knowledge as key-value mappings within their MLP parameters. Recent work has proposed explicit weight constructions to build such fact-storing MLPs, providing an improved understanding of LLM fact storage mechanisms. In this paper, we introduce an MLP construction framework that improves over previous constructions in three areas: it 1) works for all but a measure-zero set of feasible input-output pairs, 2) achieves asymptotically optimal parameter efficiency matching information-theoretic bounds for some embeddings, and 3) maintains usability within Transformers for factual recall. Through our improvements, we 1) discover a metric on value embeddings that characterizes facts-per-parameter scaling for both constructed and gradient-descent-trained MLPs, 2) identify a simple encoder-decoder mechanism that empirically matches gradient-descent MLP facts-per-parameter asymptotics across all the inputs and outputs we test, and 3) uncover a fundamental tradeoff between an MLP's fact-storage capacity and its usability within Transformers. Finally, we demonstrate a proof-of-concept application of fact-storing MLPs: modular fact editing on one-layer Transformers by \textit{replacing entire MLPs at once}.

Constructing Efficient Fact-Storing MLPs for Transformers

TL;DR

The paper advances a constructive framework for storing factual knowledge in MLPs embedded in Transformers, addressing limitations of prior constructions by handling broad input/output geometries, achieving information-theoretically near-optimal parameter efficiency, and enabling direct use for factual recall. It introduces a decodability metric ρ based on value-embedding geometry, and an encoder–decoder architecture with both gradient-based and closed-form variants, complemented by embedding whitening to boost capacity. The authors demonstrate that their MLPs are usable inside Transformer blocks and reveal a capacity–usability tradeoff, with Lipschitz-constant metrics predicting recall usability. As a practical demonstration, they implement modular fact editing in a 1-layer transformer by swapping entire MLPs, achieving substantial recall accuracy with minimal disruption to non-fact tokens. Overall, the work provides a principled, scalable path toward interpretable, parameter-efficient knowledge storage and manipulation in LLMs.

Abstract

The success of large language models (LLMs) can be attributed in part to their ability to efficiently store factual knowledge as key-value mappings within their MLP parameters. Recent work has proposed explicit weight constructions to build such fact-storing MLPs, providing an improved understanding of LLM fact storage mechanisms. In this paper, we introduce an MLP construction framework that improves over previous constructions in three areas: it 1) works for all but a measure-zero set of feasible input-output pairs, 2) achieves asymptotically optimal parameter efficiency matching information-theoretic bounds for some embeddings, and 3) maintains usability within Transformers for factual recall. Through our improvements, we 1) discover a metric on value embeddings that characterizes facts-per-parameter scaling for both constructed and gradient-descent-trained MLPs, 2) identify a simple encoder-decoder mechanism that empirically matches gradient-descent MLP facts-per-parameter asymptotics across all the inputs and outputs we test, and 3) uncover a fundamental tradeoff between an MLP's fact-storage capacity and its usability within Transformers. Finally, we demonstrate a proof-of-concept application of fact-storing MLPs: modular fact editing on one-layer Transformers by \textit{replacing entire MLPs at once}.

Paper Structure

This paper contains 124 sections, 48 theorems, 392 equations, 9 figures, 1 table, 7 algorithms.

Key Result

Proposition 2.1.1

Assuming a constant number of bits per parameter, the fact-storage cost of embeddings $\mathbf{K}$ and $\mathbf{V}$ for any model family $\mathbf{g}$ satisfies $W(\mathbf{g}; \mathbf{K}, \mathbf{V}) = \Omega(|\mathbf{K}|\log [|\mathbf{V}|])$.

Figures (9)

  • Figure 1: (Left) Top: We formalize factual knowledge as discrete maps between key and value embeddings. Bottom: Our construction consists of an encoder MLP that exactly maps keys to compressed intermediate values, and a decoder linear layer that linearly decompresses the intermediate values. (Center) We compare how the number of parameters ($y$-axis) needed to represent a fact set scales with the number of facts ($x$-axis). Our construction matches gradient-descent trained (GD) MLP asymptotics and requires $5$--$150\times$ fewer parameters than prior constructions. (Right) We compare how the number of parameters ($y$-axis) needed for an MLP to represent a fact set in a way that is usable within a transformer scales with the number of facts ($x$-axis). Our constructed MLPs exhibit similar asymptotic scaling to GD MLPs, unlike NTK MLPs. Note: NTK refers to the construction from nichani2024understandingfactualrecalltransformers.
  • Figure 2: (a) For both GD and our constructed MLPs, $\rho$ is predictive ($R^2 > 0.97$) of MLP size for a fixed number of facts. Embedding whitening reduces our constructed MLPs' fact-storage cost by up to $32\times$ and allows NTK MLPs to generalize to highly anisotropic embeddings. (b) GD MLPs and our constructed MLPs exhibit consistent facts-per-parameter scaling as embedding dimension and number of facts vary jointly, whereas NTK MLPs exhibit asymptotically worse scaling as more facts are squeezed into a fixed embedding dimension (pictured for spherical embeddings). Our constructed MLPs have between $5$--$150\times$ lower fact-storage cost than NTK MLPs, while GD MLPs have $\sim\!20\times$ lower fact-storage cost than ours. (c) When training the encoder and decoder with gradient descent, the fact-storage cost gap to GD MLPs narrows from $\sim\!20\times$ to $\sim\!4\times$.
  • Figure 3: (a) MLP size vs. fact-set size for MLPs with $\ge 99\%$ usability within Transformer. We find that fact-storing MLPs are usable within 1-layer Transformers and that our constructed MLPs and GD MLPs exhibit similar $\ge 99\%$ usability scaling. (b) MLP usability within Transformer v.s. MLP storage capacity. We observe a tradeoff between MLP usability within a Transformer and the MLP's fact-storage capacity. (c) MLP usability within Transformer v.s. its Lipschitz constant. We observe that the measured Lipschitz constant is predictive of an MLP's usability within Transformers.
  • Figure 4: Fact editing score as number of altered facts increases. Fact editing via MLP swapping outperforms prior weight updates as the number of altered facts increase. The fact-editing score is computed as the geometric mean of the efficacy, specificity and paraphrase accuracies.
  • Figure 5: NTK MLPs fail to achieve perfect fact storage for sufficiently anisotropic output embeddings. Using the margin-optimal output embeddings for the NTK construction improves fact-storage capacity by up to $4\times$, but does not improve robustness to anisotropic embeddings.
  • ...and 4 more figures

Theorems & Definitions (119)

  • Proposition 2.1.1
  • Definition 3.1.1
  • Definition 3.2.1
  • proof : Proof:
  • Lemma 4.1.1
  • Lemma 4.1.2
  • proof : Proof Sketch.
  • Theorem 4.3.1: Full Construction
  • Theorem B.1.1
  • proof
  • ...and 109 more