Table of Contents
Fetching ...

When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, Tat-Seng Chua

TL;DR

The paper tackles the challenge of long-context reasoning in large language models, where context length degrades performance and naive recurrent memory can cause memory explosion. It introduces GRU-Mem, a gated recurrent memory framework that adds an update gate and an exit gate to control when memory is updated and when the reasoning loop ends, trained via end-to-end RL with rewards $r^{\text{update}}$ and $r^{\text{exit}}$ alongside standard outcome/format rewards. Empirical results show GRU-Mem outperforms the vanilla MemAgent across diverse long-context QA benchmarks and achieves up to 400% inference speedups, especially on out-of-distribution NIAH tasks, by stabilizing memory and enabling early termination. The gating mechanism thus offers a practical path to robust, efficient long-context reasoning, though the approach is currently demonstrated within QA and may require adjustments for broader tasks due to RL-training stability concerns.

Abstract

While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.

When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

TL;DR

The paper tackles the challenge of long-context reasoning in large language models, where context length degrades performance and naive recurrent memory can cause memory explosion. It introduces GRU-Mem, a gated recurrent memory framework that adds an update gate and an exit gate to control when memory is updated and when the reasoning loop ends, trained via end-to-end RL with rewards and alongside standard outcome/format rewards. Empirical results show GRU-Mem outperforms the vanilla MemAgent across diverse long-context QA benchmarks and achieves up to 400% inference speedups, especially on out-of-distribution NIAH tasks, by stabilizing memory and enabling early termination. The gating mechanism thus offers a practical path to robust, efficient long-context reasoning, though the approach is currently demonstrated within QA and may require adjustments for broader tasks due to RL-training stability concerns.

Abstract

While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals and within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.
Paper Structure (28 sections, 13 equations, 35 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 13 equations, 35 figures, 4 tables, 1 algorithm.

Figures (35)

  • Figure 1: MemAgent and its limitations. The MemAgent reads a long context chunk-by-chunk in an RNN-like manner, recurrently updating a textual memory and answering from the final memory. It faces two crucial risks: memory explosion by over-accumulating irrelevant memories and lacking an exit mechanism when collected sufficient evidence.
  • Figure 2: The memory updating process with the gated recurrent memory (GRU-Mem). At each time step $t$, the memory agent $\phi_\theta$ decides: (1) whether to update the memory $\mathcal{M}_{t}$ with the candidate memory $\hat{\mathcal{M}}_{t}$ or to keep the previous memory $\mathcal{M}_{t-1}$ unchanged, based on the update gate status $\mathcal{U}_{t}$ ($\texttt{True}$ for updating and $\texttt{False}$ for retaining $\mathcal{M}_{t-1}$); and (2) whether to stop scanning further chunks, based on the exit gate status $\mathcal{E}_{t}$ ($\texttt{True}$ for exiting and $\texttt{False}$ for continuing processing).
  • Figure 2: Performance when evidence occurs at top 20% positions.
  • Figure 3: Prompt of GRU-Mem (partial).
  • Figure 4: The advantage calculation process. The trajectory-level advantage $\hat{A}_{g,t}^{\text{traj}}$ and the turn-level advantage $\hat{A}_{g,t}^{\text{turn}}$ are calculated separately. They are combined into the total advantage with $\alpha$ (i.e., $\hat{A}_{g,t,i} = \alpha \hat{A}_{g,t,i}^{\text{traj}} + (1-\alpha) \hat{A}_{g,t,i}^{\text{turn}}$).
  • ...and 30 more figures