Table of Contents
Fetching ...

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang

TL;DR

This work tackles long-context question answering where critical evidence is dispersed across vast corpora. It introduces ReMemR1, a memory-augmented agent that uses a history-augmented state with callback queries to retrieve from the full memory history, enabling non-linear reasoning and revisiting early evidence. To train such a system, the authors propose RLMLR, a reinforcement learning framework that combines trajectory-level final-answer rewards with dense, step-level rewards to guide memory usage and retrieval. Empirical results on HotpotQA and 2WikiMultiHopQA demonstrate significant gains over baselines, including strong generalization to out-of-distribution data and robustness under distant-evidence settings, with ablations validating the effectiveness of both RLMLR and the RL-driven memory callback.

Abstract

Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

TL;DR

This work tackles long-context question answering where critical evidence is dispersed across vast corpora. It introduces ReMemR1, a memory-augmented agent that uses a history-augmented state with callback queries to retrieve from the full memory history, enabling non-linear reasoning and revisiting early evidence. To train such a system, the authors propose RLMLR, a reinforcement learning framework that combines trajectory-level final-answer rewards with dense, step-level rewards to guide memory usage and retrieval. Empirical results on HotpotQA and 2WikiMultiHopQA demonstrate significant gains over baselines, including strong generalization to out-of-distribution data and robustness under distant-evidence settings, with ablations validating the effectiveness of both RLMLR and the RL-driven memory callback.

Abstract

Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.

Paper Structure

This paper contains 34 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of approaches for question answering in a long-context setting. (a) Full context input introduces substantial complexity and challenges the LLM to locate the correct information. (b) The "memorize while reading" paradigm processes documents by chunks to reduce context length at each step. Still, the irreversible and linear memory overwriting prevents the model from connecting distantly related information. (c) Our method augments the agent's state to allow for callback of historical memories, enabling it to integrate relevant facts from earlier steps into its reasoning process.
  • Figure 2: The comparison of state transition functions between previous work and our method. (left) Conventional memory agents use a restrictive state $s_t=m_t$, where the next memory $m_{t+1}$ only depends on the current context $c_t$ and memory $m_t$. (right) Our method introduces a history-augmented state $s_t=(m_t, q_t)$, where the agent generates a callback query $q_t$ to retrieve relevant information from its entire memory history $\{m_i\}_{i \leqslant t}$, enabling non-linear reasoning paths.
  • Figure 3: Overview of RL with Multi-Level Rewards (RLMLR). (a) From the trajectories generated by the actor model, we compute outcome rewards at terminal states and state rewards at all states. (b) Each reward type is normalized at the corresponding level: state rewards across the states at the same step, and outcome rewards across all trajectories in the group.
  • Figure 4: Accuracy on 2Wiki with distant evidences.
  • Figure 5: Training dynamics of our method. ReMemR1 enables the LLM to generate both inner memory and callback queries, introducing additional formatting requirements. These constraints initially lead to a lower success rate due to frequent parsing errors, but performance rapidly improves after around 20 steps as the model learns to follow the required format.