Table of Contents
Fetching ...

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell

TL;DR

This work tackles the challenge of evaluating and improving long-term memory in LLMs. It introduces BEAM, a scalable benchmark with up to 10M-token dialogues and multi-faceted memory probes, and LIGHT, a cognitive-inspired memory framework integrating episodic recall, working memory, and an external scratchpad. Across rigorous automated and human evaluations, LIGHT yields consistent memory gains over baselines, especially in summarization and instruction-following tasks, with ablations demonstrating each component’s value. By releasing code and data, the authors enable robust, broad benchmarking of long-context reasoning and model memory capabilities for real-world applications.

Abstract

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

TL;DR

This work tackles the challenge of evaluating and improving long-term memory in LLMs. It introduces BEAM, a scalable benchmark with up to 10M-token dialogues and multi-faceted memory probes, and LIGHT, a cognitive-inspired memory framework integrating episodic recall, working memory, and an external scratchpad. Across rigorous automated and human evaluations, LIGHT yields consistent memory gains over baselines, especially in summarization and instruction-following tasks, with ablations demonstrating each component’s value. By releasing code and data, the authors enable robust, broad benchmarking of long-context reasoning and model memory capabilities for real-world applications.

Abstract

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

Paper Structure

This paper contains 50 sections, 5 figures, 9 tables, 3 algorithms.

Figures (5)

  • Figure 1: Overview of data generation.
  • Figure 2: Overview of the LIGHT framework.
  • Figure 3: Ablation study of the effect of different components in LIGHT.
  • Figure 4: Effect of varying retrieval budget (K) on the performance.
  • Figure 5: Performance comparison between dense retrieval and sparse retrieval (SPLADE) in LIGHT.