Table of Contents
Fetching ...

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Thang Nguyen, Peter Chin, Yu-Wing Tai

TL;DR

This paper addresses the challenges of retrieval-augmented generation (RAG) in complex, ambiguous, and multi-hop information-seeking tasks by introducing MA-RAG, a training-free, modular multi-agent framework. MA-RAG delegates distinct subtasks to specialized agents—Planner, Step Definer, Extractor, and QA—who collaborate via chain-of-thought prompting to perform stepwise, interpretable reasoning and on-demand retrieval. The approach yields strong open-domain QA performance, robust generalization to medical-domain QA without domain-specific fine-tuning, and clear ablation evidence that planning and extraction are critical for multi-hop reasoning. By enabling fine-grained control over information flow and leveraging modular reasoning, MA-RAG demonstrates a scalable, interpretable path to improved grounding and reliability in retrieval-augmented systems.

Abstract

We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain-of-thought prompting, MA-RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi-hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA-RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3-8B model equipped with MA-RAG surpasses larger standalone LLMs, while larger variants (LLaMA3-70B and GPT-4o-mini) set new state-of-the-art results on challenging multi-hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi-hop reasoning, and that high-capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general-domain QA, MA-RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain-specific models without any domain-specific fine-tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems: MA-RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi-agent RAG.

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

TL;DR

This paper addresses the challenges of retrieval-augmented generation (RAG) in complex, ambiguous, and multi-hop information-seeking tasks by introducing MA-RAG, a training-free, modular multi-agent framework. MA-RAG delegates distinct subtasks to specialized agents—Planner, Step Definer, Extractor, and QA—who collaborate via chain-of-thought prompting to perform stepwise, interpretable reasoning and on-demand retrieval. The approach yields strong open-domain QA performance, robust generalization to medical-domain QA without domain-specific fine-tuning, and clear ablation evidence that planning and extraction are critical for multi-hop reasoning. By enabling fine-grained control over information flow and leveraging modular reasoning, MA-RAG demonstrates a scalable, interpretable path to improved grounding and reliability in retrieval-augmented systems.

Abstract

We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain-of-thought prompting, MA-RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi-hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA-RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3-8B model equipped with MA-RAG surpasses larger standalone LLMs, while larger variants (LLaMA3-70B and GPT-4o-mini) set new state-of-the-art results on challenging multi-hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi-hop reasoning, and that high-capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general-domain QA, MA-RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain-specific models without any domain-specific fine-tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems: MA-RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi-agent RAG.

Paper Structure

This paper contains 27 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Architectural Comparison of MA-RAG and Prior RAG Methods.a) A naive RAG system performs one-shot retrieval followed by direct answer generation. b) Enhanced systems incorporate post-retrieval processing such as document re-ranking or summarization. c) Iterative systems interleave retrieval and reasoning via query rewriting or multi-step refinement, yet often lack explicit modularity and planning. d) In contrast, MA-RAG adopts a collaborative multi-agent architecture where specialized agents handle distinct stages of the RAG pipeline, such as query disambiguation, targeted evidence extraction, and answer synthesis, using chain-of-thought reasoning. Agents are invoked dynamically and on demand, enabling fine-grained document analysis and step-by-step resolution of ambiguities, resulting in a more robust, interpretable, and efficient retrieval-to-generation process.
  • Figure 2: Overview of MA-RAG. MA-RAG is a training-free, multi-agent RAG framework that decomposes complex queries into interpretable steps through collaborative reasoning. The left panel shows individual components and their I/O interfaces; the right panel illustrates the overall iterative workflow. A Planner Agent first breaks down the input query into a high-level reasoning plan. For each step, a Step Definer Agent generates a detailed subquery based on the step goal, original question, and prior outputs. This subquery is processed by the Retrieval Tool to fetch top-ranked documents, which are then refined by the Extractor Agent to retain only step-relevant content. The QA Agent synthesizes the final answer for each step using the filtered evidence and subquery. MA-RAG iterates through these steps until the full reasoning path is complete.
  • Figure 3: Exact Match (EM) performance of MA-RAG and baseline methods on NQ, HotpotQA, and 2WikimQA. The green star indicates MA-RAG with LLaMA3-8B, the blue star indicates MA-RAG with LLaMA3-70B, and the red star indicates MA-RAG with GPT-4o-mini. Across all datasets, MA-RAG consistently outperforms baseline methods using the same model size, demonstrating the effectiveness of our multi-agent reasoning approach.
  • Figure 4: MA-RAG graph representations in Langchain.