Table of Contents
Fetching ...

LM2: Large Memory Models

Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis

TL;DR

LM2 addresses the challenge of long-context reasoning in transformers by introducing an explicit memory module embedded in the decoder, connected via a cross-attention memory bank and gated updates. The approach preserves the original Transformer information flow while adding a dynamic memory pathway, enabling robust multi-hop and numerical reasoning across contexts up to 128K tokens. Empirical results show LM2 significantly outperforms prior memory-augmented models on BABILong and improves general task performance on MMLU, while analysis highlights interpretable memory representations and adaptive test-time memory behavior. The work underscores the value of explicit long-term memory in enhancing transformer capabilities and lays groundwork for future memory-integrated large-language models.

Abstract

This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.

LM2: Large Memory Models

TL;DR

LM2 addresses the challenge of long-context reasoning in transformers by introducing an explicit memory module embedded in the decoder, connected via a cross-attention memory bank and gated updates. The approach preserves the original Transformer information flow while adding a dynamic memory pathway, enabling robust multi-hop and numerical reasoning across contexts up to 128K tokens. Empirical results show LM2 significantly outperforms prior memory-augmented models on BABILong and improves general task performance on MMLU, while analysis highlights interpretable memory representations and adaptive test-time memory behavior. The work underscores the value of explicit long-term memory in enhancing transformer capabilities and lays groundwork for future memory-integrated large-language models.

Abstract

This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.

Paper Structure

This paper contains 25 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of LM2 overall architecture. It consists of a separate memory bank, which updates the main information flow through cross attention, and is updated using the input ($\mathcal{I}$), output ($\mathcal{O}$), and forget ($\mathcal{F}$) gates. For the information flow from one block to another, the gray curve shows the normal attention flow and the pink curve shows the extra memory flow.
  • Figure 2: Illustration of how memory module works inside of each decoding block, where blue, green, and red box corresponds to forget, input, and output phase.
  • Figure 3: Performance on BABILong benchmark with different capabilities.
  • Figure 4: We sample a question from MMLU to test the LM2 in a few-shot fashion. To study how the memory module focuses on relevant information, we place useful information inside one of the few-shot examples.
  • Figure 5: We evaluate variations of integrating memory within the decoder blocks. The number indicates how many of the initial decoder blocks include the memory module, as we found that the order of implementing memory modules does not affect performance.
  • ...and 1 more figures