Table of Contents
Fetching ...

MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

Qi Luo, Xiaonan Li, Yuxin Wang, Tingshuo Fan, Yuan Li, Xinchi Chen, Xipeng Qiu

TL;DR

The paper tackles the bottleneck in retrieval-augmented generation where a single retriever constrains access to external information. It proposes MARAG-R1, a multi-tool agentic RAG framework that uses four retrieval tools (semantic search, keyword search, filtering, aggregation) and a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning) to dynamically orchestrate information gathering. Through a carefully designed reward structure and Leave-One-Out policy optimization, MARAG-R1 learns effective tool sequencing and reasoning for corpus-level synthesis, achieving state-of-the-art performance on GlobalQA and multi-hop QA benchmarks. The results demonstrate that explicit tool coordination and process-level supervision enable robust corpus-wide reasoning and generalize well to unseen multi-hop tasks, with scalable gains as model size increases.

Abstract

Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.

MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

TL;DR

The paper tackles the bottleneck in retrieval-augmented generation where a single retriever constrains access to external information. It proposes MARAG-R1, a multi-tool agentic RAG framework that uses four retrieval tools (semantic search, keyword search, filtering, aggregation) and a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning) to dynamically orchestrate information gathering. Through a carefully designed reward structure and Leave-One-Out policy optimization, MARAG-R1 learns effective tool sequencing and reasoning for corpus-level synthesis, achieving state-of-the-art performance on GlobalQA and multi-hop QA benchmarks. The results demonstrate that explicit tool coordination and process-level supervision enable robust corpus-wide reasoning and generalize well to unseen multi-hop tasks, with scalable gains as model size increases.

Abstract

Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.

Paper Structure

This paper contains 38 sections, 11 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of the MARAG-R1 framework. (a) The standard RAG model performs a single-round retrieval from a fixed top-$k$ document set, which often limits knowledge coverage. (b) The graph-based RAG models structured semantic relations among documents, which enhances global awareness but loses the original document-level information during graph construction. (c) In contrast, our MARAG-R1 framework dynamically coordinates multiple specialized retrieval tools to access and integrate diverse external information, achieving more comprehensive and factual reasoning.
  • Figure 2: Overview of the MARAG-R1 framework.
  • Figure 3: Training dynamics during RL optimization. Subfigures show (a) steady decrease in training perplexity indicating improved policy coherence, (b) rapid initial reward growth stabilizing after 40 steps as the policy converges, and (c-d) consistent improvements in both test F1 and D-F1@20 scores validating effective generalization.
  • Figure 4: F1/D-F1@20 performance of MARAG-R1 and ReCall under different retrieval steps.