Table of Contents
Fetching ...

RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu

TL;DR

RAG-R1 tackles the brittleness and latency of traditional single-query retrieval in retrieval-augmented reasoning by introducing a two-stage training framework that enables adaptive use of internal and external knowledge and by replacing serial, single-query retrieval with multi-query parallelism. The approach combines Format Learning Supervised Fine-Tuning to instill think-then-search behavior with Retrieval-Augmented Reinforcement Learning to optimize reasoning and retrieval in a reward-driven setting, including a masked loss for retrieved content and a rule-based Exact Match reward. Empirical results across seven open-domain QA benchmarks show state-of-the-art performance, with multi-query parallelism delivering notable gains in accuracy (up to 13.7% over the strongest baseline) and reductions in inference time (up to 11.1%), along with strong generalization to out-of-domain data and online search. Overall, RAG-R1 offers a scalable, robust framework for integrating reasoning and retrieval in LLMs, with practical implications for latency-sensitive applications and dynamic information environments.

Abstract

Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.

RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

TL;DR

RAG-R1 tackles the brittleness and latency of traditional single-query retrieval in retrieval-augmented reasoning by introducing a two-stage training framework that enables adaptive use of internal and external knowledge and by replacing serial, single-query retrieval with multi-query parallelism. The approach combines Format Learning Supervised Fine-Tuning to instill think-then-search behavior with Retrieval-Augmented Reinforcement Learning to optimize reasoning and retrieval in a reward-driven setting, including a masked loss for retrieved content and a rule-based Exact Match reward. Empirical results across seven open-domain QA benchmarks show state-of-the-art performance, with multi-query parallelism delivering notable gains in accuracy (up to 13.7% over the strongest baseline) and reductions in inference time (up to 11.1%), along with strong generalization to out-of-domain data and online search. Overall, RAG-R1 offers a scalable, robust framework for integrating reasoning and retrieval in LLMs, with practical implications for latency-sensitive applications and dynamic information environments.

Abstract

Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.

Paper Structure

This paper contains 22 sections, 2 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of single-query and multi-query methods on Multi-Hop QA benchmarks based on Qwen2.5-72B-Instruct: (a) Model performance evaluated by Exact Match metric; (b) Average retrieval iterations during inference. The multi-query approach achieves a higher Exact Match score with fewer retrieval iterations.
  • Figure 2: Overview of the RAG-R1 training framework, consisting of two stages: Format Learning Supervised Fine-Tuning (Section \ref{['sec:SFT']}) and Retrieval-Augmented Reinforcement Learning (Section \ref{['sec:rl']}), along with the data generation details.
  • Figure 3: Performance comparison of RAG-R1 and three baselines within the online search scenario. Our models consistently deliver robust results across both offline and online settings, highlighting the strong generalization capabilities of our approach.