Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

Kehao Zhang; Shangtong Gui; Sheng Yang; Wei Chen; Yang Feng

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

Kehao Zhang, Shangtong Gui, Sheng Yang, Wei Chen, Yang Feng

TL;DR

This work addresses the limitations of long-context LLMs and retrieval-based systems in dynamic, updating streams, where pure retrieval or internal memory is insufficient for consistent reasoning over time. It introduces the Unified Memory Agent (UMA), an end-to-end reinforcement learning framework that jointly optimizes memory operations (CRUD) and question answering, using a dual-memory design with a compact core memory $m^{core}$ and a structured Memory Bank $\mathcal{B}$. A novel Ledger-QA benchmark is proposed to evaluate long-horizon state tracking, where answers are latent aggregates derived from accumulated updates, challenging persistent state maintenance. Through Task-Stratified GRPO and nested trajectory sampling, UMA demonstrates superior performance on TTL and dynamic reasoning benchmarks while remaining competitive on AR, and ablations confirm the necessity of end-to-end memory management for robust long-context intelligence.

Abstract

Long-context LLMs and Retrieval-Augmented Generation (RAG) systems process information passively, deferring state tracking, contradiction resolution, and evidence aggregation to query time, which becomes brittle under ultra long streams with frequent updates. We propose the Unified Memory Agent (UMA), an end-to-end reinforcement learning framework that unifies memory operations and question answering within a single policy. UMA maintains a dual memory representation: a compact core summary for global context and a structured Memory Bank that supports explicit CRUD (create, update, delete, reorganize) over key value entries, enabling proactive consolidation during streaming. To evaluate long-horizon memory behavior, we introduce Ledger-QA, a diagnostic benchmark for continuous state tracking where answers are latent values derived from accumulated updates rather than lo cal span retrieval. Across 13 datasets spanning Ledger-QA, Test-Time Learning, and Accurate Retrieval, UMA substantially outperforms long-context and RAG baselines on dynamic reasoning and learning tasks while remaining competitive on standard retrieval benchmarks, underscoring the importance of learned, end-to-end memory management.

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

TL;DR

and a structured Memory Bank

. A novel Ledger-QA benchmark is proposed to evaluate long-horizon state tracking, where answers are latent aggregates derived from accumulated updates, challenging persistent state maintenance. Through Task-Stratified GRPO and nested trajectory sampling, UMA demonstrates superior performance on TTL and dynamic reasoning benchmarks while remaining competitive on AR, and ablations confirm the necessity of end-to-end memory management for robust long-context intelligence.

Abstract

Paper Structure (80 sections, 9 equations, 4 figures, 8 tables)

This paper contains 80 sections, 9 equations, 4 figures, 8 tables.

Introduction
Method
Problem Formulation as MDP
State Space ($\mathcal{S}$).
Action Space ($\mathcal{A}$).
Transition Dynamics ($\mathcal{P}$).
Unified Memory Agent Architecture
Input Representation.
Phase I: Sequential Memory Maintenance.
Phase II: Hybrid Retrieval-Augmented QA.
Training: Task-Stratified GRPO
Nested Trajectory Sampling
Reward Function
Monte Carlo Advantage Estimation
Advantage for Memory Steps ($A_{mem}$).
...and 65 more sections

Figures (4)

Figure 1: Expense-tracking example: RAG reprocesses retrieved logs per query, while Agentic Memory maintains a structured state and answers by reading the relevant fields.
Figure 2: Overview of UMA. Phase I incrementally maintains a structured Memory Bank and core summary via CRUD over chunks; Phase II answers queries using both structured retrieval from the bank and raw-context retrieval.
Figure 3: Illustration of Task-Stratified GRPO. For a given input, multiple trajectories are sampled containing interleaved Memory (blue) and QA (red) steps. (Right) The reward function combines immediate tool execution feedback ($r_{tool}$) with outcome assessments ($r_{outcome}$). Crucially, memory steps receive a Future Utility Signal derived from subsequent QA rewards. (Bottom) Advantages are normalized within distinct groups: all memory steps are aggregated into a global pool ($\mathcal{G}_{mem}$), while QA steps are normalized strictly within their specific query groups ($\mathcal{G}_{qa,j}$).
Figure 4: Performance comparison on Ledger-QA across varying session counts. The x-axis represents the number of dialogue sessions (simulating increasing time horizons), and the y-axis denotes accuracy. Detailed numerical results are provided in Table \ref{['tab:ledger_detailed']} in Appendix \ref{['app:ledger_results']}.

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

TL;DR

Abstract

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)