The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context
Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, Yan Wang
TL;DR
This work tackles the problem of stateless, context-insensitive language models by introducing StateLM, a class of foundation models that learn to manage their own internal context via a Pensieve-inspired external memory and a toolkit of memory operations. StateLM employs a learned, stateful reasoning loop that performs read, index, note-taking, and deliberate deletion of intermediate content to maintain a compact, noise-free working state, enabling effective long-horizon reasoning. Trained with supervised trajectories and reinforced through a GRPO-style objective, StateLM achieves substantial gains on long-document QA, chat memory, and deep-research tasks—outperforming strong baselines and existing agentic methods by large margins (e.g., up to 52% on BrowseComp-Plus and 10–20% on chat memory). The approach shifts LLMs from passive predictors to autonomous, state-aware agents and demonstrates a scalable framework for learned context management across diverse domains, reducing reliance on externally engineered workflows. Overall, StateLM represents a principled step toward scalable, internalized memory management in foundation models with broad practical implications for AI assistants and retrieval-augmented systems.
Abstract
In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.
