RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
TL;DR
RAG-R1 tackles the brittleness and latency of traditional single-query retrieval in retrieval-augmented reasoning by introducing a two-stage training framework that enables adaptive use of internal and external knowledge and by replacing serial, single-query retrieval with multi-query parallelism. The approach combines Format Learning Supervised Fine-Tuning to instill think-then-search behavior with Retrieval-Augmented Reinforcement Learning to optimize reasoning and retrieval in a reward-driven setting, including a masked loss for retrieved content and a rule-based Exact Match reward. Empirical results across seven open-domain QA benchmarks show state-of-the-art performance, with multi-query parallelism delivering notable gains in accuracy (up to 13.7% over the strongest baseline) and reductions in inference time (up to 11.1%), along with strong generalization to out-of-domain data and online search. Overall, RAG-R1 offers a scalable, robust framework for integrating reasoning and retrieval in LLMs, with practical implications for latency-sensitive applications and dynamic information environments.
Abstract
Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.
