Table of Contents
Fetching ...

RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation

Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du, Yunpu Ma, Yijun Tian

TL;DR

RECON tackles inefficiencies in RL-based retrieval-augmented generation by inserting a learned summarization module that condenses evidence after each retrieval. It trains the summarizer in two stages—MS MARCO relevance pretraining and multi-aspect distillation from GPT-4o-mini—and integrates it into the Search-R1 pipeline. The approach reduces context length by $35\%$ and yields faster training ($5.2\%$) and lower inference latency ($30.9\%$), while improving QA accuracy, especially on multi-hop tasks (3B EM up by $14.5\%$, 7B EM by $3.0\%$). These results highlight learned context compression as a practical design for scalable, high-performance RAG systems.

Abstract

Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35\%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5\% and the 7B model by 3.0\%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at https://github.com/allfornancy/RECON.

RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation

TL;DR

RECON tackles inefficiencies in RL-based retrieval-augmented generation by inserting a learned summarization module that condenses evidence after each retrieval. It trains the summarizer in two stages—MS MARCO relevance pretraining and multi-aspect distillation from GPT-4o-mini—and integrates it into the Search-R1 pipeline. The approach reduces context length by and yields faster training () and lower inference latency (), while improving QA accuracy, especially on multi-hop tasks (3B EM up by , 7B EM by ). These results highlight learned context compression as a practical design for scalable, high-performance RAG systems.

Abstract

Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35\%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5\% and the 7B model by 3.0\%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at https://github.com/allfornancy/RECON.

Paper Structure

This paper contains 24 sections, 4 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Training pipeline of our method. In the rollout module, instead of directly using retrieval results from the search engine (dashed line), an additional summarization model is used to condense the retrieved information and remove noises from document sources. This way we reduce the context length and achieve efficient and effective rollout in both training and inference. Refer to \ref{['asec:algorithmic']} for algorithmic details.
  • Figure 2: Inference efficiency of RECON vs. Search-R1, with Qwen2.5-7B-Base + PPO. We report average context length ($\downarrow$), inference wallclock time per query ($\downarrow$) and number of search turns ($\downarrow$) over 7 datasets.