Table of Contents
Fetching ...

Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

Yihong Wu, Liheng Ma, Muzhi Li, Jiaming Zhou, Lei Ding, Jianye Hao, Ho-fung Leung, Irwin King, Yingxue Zhang, Jian-Yun Nie

TL;DR

The paper tackles the long-context bottleneck in open-domain multi-hop QA by proposing Mujica, a planner–worker multi-agent RAG framework that decomposes complex questions via a directed acyclic graph (DAG) of subquestions. It further introduces MyGO, a minimalist post-training method that uses sampling from an asymptotically optimal distribution and maximum likelihood updates, avoiding value networks and reward shaping. The authors provide theoretical convergence guarantees for MyGO and demonstrate strong empirical gains across text and knowledge-graph QA benchmarks, including MuSiQue, 2Wiki, and Hotpot, with improvements in both accuracy and training efficiency. Overall, Mujica-MyGO reduces context-length demands while delivering robust reasoning and generalization, offering a scalable approach for real-world RAG systems.

Abstract

Large Language Models (LLMs) equipped with modern Retrieval-Augmented Generation (RAG) systems often employ multi-turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi-turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well-known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in-context learning, where few-shot demonstrations must also be included in the prompt, compounding the context-length bottleneck. To address these challenges, we propose Mujica-MyGo, a unified framework for efficient multi-turn reasoning in RAG. Inspired by the divide-and-conquer principle, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues. To eliminate the dependency on in-context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post-training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question-answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica-MyGO achieves superior performance.

Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

TL;DR

The paper tackles the long-context bottleneck in open-domain multi-hop QA by proposing Mujica, a planner–worker multi-agent RAG framework that decomposes complex questions via a directed acyclic graph (DAG) of subquestions. It further introduces MyGO, a minimalist post-training method that uses sampling from an asymptotically optimal distribution and maximum likelihood updates, avoiding value networks and reward shaping. The authors provide theoretical convergence guarantees for MyGO and demonstrate strong empirical gains across text and knowledge-graph QA benchmarks, including MuSiQue, 2Wiki, and Hotpot, with improvements in both accuracy and training efficiency. Overall, Mujica-MyGO reduces context-length demands while delivering robust reasoning and generalization, offering a scalable approach for real-world RAG systems.

Abstract

Large Language Models (LLMs) equipped with modern Retrieval-Augmented Generation (RAG) systems often employ multi-turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi-turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well-known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in-context learning, where few-shot demonstrations must also be included in the prompt, compounding the context-length bottleneck. To address these challenges, we propose Mujica-MyGo, a unified framework for efficient multi-turn reasoning in RAG. Inspired by the divide-and-conquer principle, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues. To eliminate the dependency on in-context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post-training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question-answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica-MyGO achieves superior performance.

Paper Structure

This paper contains 52 sections, 4 theorems, 10 equations, 5 figures, 12 tables.

Key Result

Proposition 3.0

Given $\delta \in \mathbb{R}$, $\alpha \in \mathbb{R}$, there exists $K \in \mathbb{R}^+$ such that the following inequality holds:

Figures (5)

  • Figure 1: The end-to-end architecture of the proposed Mujica framework.
  • Figure 2: Modeling Complex Question Answering Process as a Directed Acyclic Graph.
  • Figure 3: The Proposed Minimalist Policy Gradient Optimization Framework.
  • Figure 4: F1 score on 2Wiki-KG training
  • Figure 5: F1 score on 2Wiki-Text training

Theorems & Definitions (4)

  • Proposition 3.0
  • Proposition 3.0
  • Proposition B.0
  • Proposition B.0