Table of Contents
Fetching ...

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, Yan Wang

TL;DR

This work tackles the problem of stateless, context-insensitive language models by introducing StateLM, a class of foundation models that learn to manage their own internal context via a Pensieve-inspired external memory and a toolkit of memory operations. StateLM employs a learned, stateful reasoning loop that performs read, index, note-taking, and deliberate deletion of intermediate content to maintain a compact, noise-free working state, enabling effective long-horizon reasoning. Trained with supervised trajectories and reinforced through a GRPO-style objective, StateLM achieves substantial gains on long-document QA, chat memory, and deep-research tasks—outperforming strong baselines and existing agentic methods by large margins (e.g., up to 52% on BrowseComp-Plus and 10–20% on chat memory). The approach shifts LLMs from passive predictors to autonomous, state-aware agents and demonstrates a scalable framework for learned context management across diverse domains, reducing reliance on externally engineered workflows. Overall, StateLM represents a principled step toward scalable, internalized memory management in foundation models with broad practical implications for AI assistants and retrieval-augmented systems.

Abstract

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

TL;DR

This work tackles the problem of stateless, context-insensitive language models by introducing StateLM, a class of foundation models that learn to manage their own internal context via a Pensieve-inspired external memory and a toolkit of memory operations. StateLM employs a learned, stateful reasoning loop that performs read, index, note-taking, and deliberate deletion of intermediate content to maintain a compact, noise-free working state, enabling effective long-horizon reasoning. Trained with supervised trajectories and reinforced through a GRPO-style objective, StateLM achieves substantial gains on long-document QA, chat memory, and deep-research tasks—outperforming strong baselines and existing agentic methods by large margins (e.g., up to 52% on BrowseComp-Plus and 10–20% on chat memory). The approach shifts LLMs from passive predictors to autonomous, state-aware agents and demonstrates a scalable framework for learned context management across diverse domains, reducing reliance on externally engineered workflows. Overall, StateLM represents a principled step toward scalable, internalized memory management in foundation models with broad practical implications for AI assistants and retrieval-augmented systems.

Abstract

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.
Paper Structure (33 sections, 5 equations, 9 figures, 10 tables)

This paper contains 33 sections, 5 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: StateLM (right) maintains a "sawtooth" context-use profile, rather than monotonic accumulation (left).
  • Figure 2: The self-context engineering workflow of StateLM. Given a query over a long context, StateLM engages in a multi-round, stateful reasoning loop that analyzes the input, builds an index, and iteratively searches, reads, takes notes, and prunes its working context. Messages highlighted in red are replaced with stubs after the deletion operation. The loop terminates once StateLM determines it has gathered sufficient information for the final answer.
  • Figure 3: The two training stages of StateLM.
  • Figure 4: NovelQA accuracy by answer evidence token position in the provided context.
  • Figure 5: Performance breakdown by problem aspect on NovelQA (left) and LongMemEval (middle), and by question complexity on NovelQA (right).
  • ...and 4 more figures