Table of Contents
Fetching ...

R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

Yuan Li, Qi Luo, Xiaonan Li, Bufan Li, Qinyuan Cheng, Bo Wang, Yining Zheng, Yuxin Wang, Zhangyue Yin, Xipeng Qiu

TL;DR

R3-RAG tackles the bottleneck of retrieval-augmented generation by enabling LLMs to learn step-by-step reasoning and retrieval through reinforcement learning. It first cold-starts with supervised trajectories to bootstrap iterative reasoning and retrieval, then uses PPO-based RL with both outcome and process rewards to refine trajectories toward correct answers and relevant documents. Empirical results on HotpotQA, 2WikiMultiHopQA, and MuSiQue show clear improvements over strong baselines and good transferability across retrievers, with efficiency gains from more targeted interactions. The work advances practical, scalable retrieval-augmented reasoning for multi-hop questions and highlights the value of fine-grained process signals in guiding retrieval.

Abstract

Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose $\textbf{R3-RAG}$, which uses $\textbf{R}$einforcement learning to make the LLM learn how to $\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.

R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

TL;DR

R3-RAG tackles the bottleneck of retrieval-augmented generation by enabling LLMs to learn step-by-step reasoning and retrieval through reinforcement learning. It first cold-starts with supervised trajectories to bootstrap iterative reasoning and retrieval, then uses PPO-based RL with both outcome and process rewards to refine trajectories toward correct answers and relevant documents. Empirical results on HotpotQA, 2WikiMultiHopQA, and MuSiQue show clear improvements over strong baselines and good transferability across retrievers, with efficiency gains from more targeted interactions. The work advances practical, scalable retrieval-augmented reasoning for multi-hop questions and highlights the value of fine-grained process signals in guiding retrieval.

Abstract

Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose , which uses einforcement learning to make the LLM learn how to eason and etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.

Paper Structure

This paper contains 38 sections, 9 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of different RAG approaches: (a) Vanilla RAG: the LLM uses the documents retrieved for the original question to generate the response; (b) Iterative RAG: the LLM interleaves thinking and invoking the retriever in a fixed, human-designed workflow; and (c) R3-RAG: uses reinforcement learning (RL) to enable the LLM to better reason and retrieve iteratively, to get relevant documents and produce the correct answer.
  • Figure 2: Training Pipeline and Reward Design for R3-RAG
  • Figure 3: Impact of the maximum number of reasoning steps on HotpotQA and 2WikiMultiHopQA. Results are shown for both Qwen and Llama backbone models. All models are trained with up to five reasoning steps.
  • Figure 4: R3-RAG performance across different retrieval top $k$.
  • Figure 5: Case Study: R3-RAG can reason and retrieve step-by-step and adaptively modify the query when retrieval fails.
  • ...and 2 more figures