Table of Contents
Fetching ...

Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

Zhuoen Chen, Dongfang Li, Meishan Zhang, Baotian Hu, Min Zhang

TL;DR

LycheeMemory tackles the challenge of long-context processing by compressing long inputs into a memory bank of KV-cache tokens and performing selective, state-aware recall with a Gate and a Reasoner. The compressor and reasoner are trained end-to-end via reinforcement learning, while the Gate is trained separately as a classifier, enabling effective multi-hop reasoning over ultra-long contexts. It achieves competitive accuracy on multi-hop QA benchmarks, scales context length from 7K to 1.75M tokens, and delivers substantial efficiency gains including up to 2x memory reduction and 6x faster inference compared to MemAgent. This approach offers a scalable, interpretable solution for ultra-long context processing with practical impact on long-document QA and summarization tasks.

Abstract

Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.

Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

TL;DR

LycheeMemory tackles the challenge of long-context processing by compressing long inputs into a memory bank of KV-cache tokens and performing selective, state-aware recall with a Gate and a Reasoner. The compressor and reasoner are trained end-to-end via reinforcement learning, while the Gate is trained separately as a classifier, enabling effective multi-hop reasoning over ultra-long contexts. It achieves competitive accuracy on multi-hop QA benchmarks, scales context length from 7K to 1.75M tokens, and delivers substantial efficiency gains including up to 2x memory reduction and 6x faster inference compared to MemAgent. This approach offers a scalable, interpretable solution for ultra-long context processing with practical impact on long-document QA and summarization tasks.

Abstract

Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
Paper Structure (94 sections, 14 equations, 7 figures, 7 tables)

This paper contains 94 sections, 14 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: LycheeMemory achieved the best performance and latency. Left: Relative performance comparison of various methods on the Qwen2.5-7B model across different LongBench datasets. Right: Inference time comparison across different context lengths of 128 samples.
  • Figure 2: Overview of the LycheeMemory framework. The left panel illustrates compressed memory construction, where a long document is segmented and compressed into compact KV-cache representations by the compressor. The right panel depicts the dynamic recall and reasoning workflow, in which the gate selectively activates relevant memory blocks and the reasoner iteratively updates the working memory to produce the final answer.
  • Figure 3: Inference latency as context length increases. LycheeMemory exhibits a nearly flat latency curve, in contrast to the quadratic and linear increases observed in the full-context and MemAgent baselines respectively.
  • Figure 4: QA Accuracy across varying context lengths under different compression ratios. The $4\times$ ratio (Ours) achieves the optimal balance, matching the stability of $2\times$ while significantly outperforming aggressive compression ($16\times$).
  • Figure 5: Training reward curves (raw data). The blue line (Frozen Compressor) converges quickly but hits a performance plateau. The red line (End-to-End) exhibits higher variance initially due to the exploration of the compression policy but achieving higher rewards.
  • ...and 2 more figures