MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval
Qi Luo, Xiaonan Li, Yuxin Wang, Tingshuo Fan, Yuan Li, Xinchi Chen, Xipeng Qiu
TL;DR
The paper tackles the bottleneck in retrieval-augmented generation where a single retriever constrains access to external information. It proposes MARAG-R1, a multi-tool agentic RAG framework that uses four retrieval tools (semantic search, keyword search, filtering, aggregation) and a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning) to dynamically orchestrate information gathering. Through a carefully designed reward structure and Leave-One-Out policy optimization, MARAG-R1 learns effective tool sequencing and reasoning for corpus-level synthesis, achieving state-of-the-art performance on GlobalQA and multi-hop QA benchmarks. The results demonstrate that explicit tool coordination and process-level supervision enable robust corpus-wide reasoning and generalize well to unseen multi-hop tasks, with scalable gains as model size increases.
Abstract
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
