Table of Contents
Fetching ...

$\text{Memory}^3$: Language Modeling with Explicit Memory

Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, Chenyang Xi, Yu Yu, Kai Chen, Feiyu Xiong, Linpeng Tang, Weinan E

TL;DR

Memory^3 addresses the high cost of training and running large language models by externalizing knowledge into explicit memory, forming a memory hierarchy where expensive writes trade off with cheaper reads. The approach introduces a formal memory circuitry theory and a sparse, externally stored memory bank, coupled with a two-stage pretraining regime to bias learning toward abstract knowledge while externalizing specific facts. A 2.4B Memory^3 model demonstrates superior performance to larger models and faster decoding than retrieval-augmented generation, aided by dense/ sparse memory mechanisms, FAISS-based retrieval, and a vector-quantized memory store. Across general benchmarks and professional tasks, Memory^3 shows competitive or superior results, improved factuality, and notable speedups, suggesting practical pathways to cheaper, scalable LLMs with infinite-context capabilities and easier task adaptation. The work lays groundwork for further refinements in memory formats, dynamic memory management, and end-to-end systems leveraging explicit memories for real-time deployment.

Abstract

The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

$\text{Memory}^3$: Language Modeling with Explicit Memory

TL;DR

Memory^3 addresses the high cost of training and running large language models by externalizing knowledge into explicit memory, forming a memory hierarchy where expensive writes trade off with cheaper reads. The approach introduces a formal memory circuitry theory and a sparse, externally stored memory bank, coupled with a two-stage pretraining regime to bias learning toward abstract knowledge while externalizing specific facts. A 2.4B Memory^3 model demonstrates superior performance to larger models and faster decoding than retrieval-augmented generation, aided by dense/ sparse memory mechanisms, FAISS-based retrieval, and a vector-quantized memory store. Across general benchmarks and professional tasks, Memory^3 shows competitive or superior results, improved factuality, and notable speedups, suggesting practical pathways to cheaper, scalable LLMs with infinite-context capabilities and easier task adaptation. The work lays groundwork for further refinements in memory formats, dynamic memory management, and end-to-end systems leveraging explicit memories for real-time deployment.

Abstract

The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named , since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
Paper Structure (41 sections, 20 equations, 13 figures, 11 tables)

This paper contains 41 sections, 20 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: The Memory$^3$ model converts texts to explicit memories, and then recalls these memories during inference. The explicit memories can be seen as retrievable model parameters, externalized knowledge, or sparsely-activated neural circuits.
  • Figure 2: Left: Performance on benchmarks, with respect to model size (top-left is better). Right: Retrieval-augmented performance on professional tasks, versus decoding speed with retrieval (top-right is better). The left plot is based on Table \ref{['tab:basic-results']}. The right plot is based on Tables \ref{['tab:domain-results']} and \ref{['tab:throughput']}. Memory$^3$ uses high frequency retrieval of explicit memories, while the RAG models use a fixed amount of 5 references. This is a preliminary experiment and we have not optimized the quality of our pretraining data as well as the efficiency of our inference pipeline, so the results may not be comparable to those of the SOTA models.
  • Figure 3: The total cost (TFlops) of writing and reading a piece of knowledge by our 2.4B model with respect to its expected usage count. The curves represent the cost of different memory formats, and the shaded area represents the minimum cost given the optimal format. The plot indicates that $(0.494, 13400)$ is the advantage interval for explicit memory. The calculations are provided in Appendix \ref{['appendix:cost']}. (The blue curve is only a lower bound on the cost of model parameters.)
  • Figure 4: Categorization of knowledge and memory formats. The explicit memories, extracted from model activations, lie half-way between raw data and model parameters, so we use a dotted line to indicate that they may or may not be regarded as parameters.
  • Figure 5: Illustration of three subgraphs. Left: A subgraph that inputs "the capital of China is" and outputs "Beijing". The knowledge neuron is marked in red and the mover heads in green. Middle: Another subgraph with similar function using task-specific heads. Right: The induction-heads subgraph that inputs "[a][b]...[a]" and outputs [b], where [a], [b] are arbitrary tokens. The notations are introduced in Section \ref{['sec:knowledge']}. The locations of these attention heads and MLP neurons may be variable.
  • ...and 8 more figures

Theorems & Definitions (23)

  • Example 1
  • Example 2
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Example 3
  • Example 4
  • Definition 5
  • Definition 6
  • ...and 13 more