The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu; Tian Liang; Dongyang Ma; Deyu Zhou; Haitao Mi; Pinjia He; Yan Wang

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, Yan Wang

TL;DR

This work tackles the problem of stateless, context-insensitive language models by introducing StateLM, a class of foundation models that learn to manage their own internal context via a Pensieve-inspired external memory and a toolkit of memory operations. StateLM employs a learned, stateful reasoning loop that performs read, index, note-taking, and deliberate deletion of intermediate content to maintain a compact, noise-free working state, enabling effective long-horizon reasoning. Trained with supervised trajectories and reinforced through a GRPO-style objective, StateLM achieves substantial gains on long-document QA, chat memory, and deep-research tasks—outperforming strong baselines and existing agentic methods by large margins (e.g., up to 52% on BrowseComp-Plus and 10–20% on chat memory). The approach shifts LLMs from passive predictors to autonomous, state-aware agents and demonstrates a scalable framework for learned context management across diverse domains, reducing reliance on externally engineered workflows. Overall, StateLM represents a principled step toward scalable, internalized memory management in foundation models with broad practical implications for AI assistants and retrieval-augmented systems.

Abstract

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 9 figures, 10 tables)

This paper contains 33 sections, 5 equations, 9 figures, 10 tables.

Introduction
Related Work: Human as the Wizard
RAG
Agentic Memory
This Study: The Model as its Own Context Engineer.
Methodology
Problem Setup
Our Method: StateLM Reasoning with Pensieve
Training Approach
Supervised Learning from Expert Trajectories
Outcome-based Reject Sampling.
Process-based Reject Sampling.
Training Sample Construction.
Action Balancing.
Reinforcement Learning for Self-Improvement
...and 18 more sections

Figures (9)

Figure 1: StateLM (right) maintains a "sawtooth" context-use profile, rather than monotonic accumulation (left).
Figure 2: The self-context engineering workflow of StateLM. Given a query over a long context, StateLM engages in a multi-round, stateful reasoning loop that analyzes the input, builds an index, and iteratively searches, reads, takes notes, and prunes its working context. Messages highlighted in red are replaced with stubs after the deletion operation. The loop terminates once StateLM determines it has gathered sufficient information for the final answer.
Figure 3: The two training stages of StateLM.
Figure 4: NovelQA accuracy by answer evidence token position in the provided context.
Figure 5: Performance breakdown by problem aspect on NovelQA (left) and LongMemEval (middle), and by question complexity on NovelQA (right).
...and 4 more figures

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

TL;DR

Abstract

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Authors

TL;DR

Abstract

Table of Contents

Figures (9)