Table of Contents
Fetching ...

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

TL;DR

This work tackles the challenge of training vertical multi-agent systems where each agent can employ a distinct LLM, leading to asynchronous rollouts and fragmented gradient flow. It introduces M-GRPO, a hierarchical extension of Group Relative Policy Optimization that computes group-relative advantages for a main planner and multiple sub-agents, paired with a trajectory-alignment scheme and a decoupled, multi-server training pipeline. The approach enables fixed-size batches despite variable sub-agent invocations and decouples optimization across servers, yielding improved stability and sample efficiency on real-world benchmarks GAIA, XBench-DeepSearch, and WebWalkerQA. Results show that co-training both main and sub-agents outperforms single-agent baselines and fixed-subagent configurations, illustrating the value of role specialization and cross-agent coordination for tool-augmented reasoning tasks.

Abstract

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

TL;DR

This work tackles the challenge of training vertical multi-agent systems where each agent can employ a distinct LLM, leading to asynchronous rollouts and fragmented gradient flow. It introduces M-GRPO, a hierarchical extension of Group Relative Policy Optimization that computes group-relative advantages for a main planner and multiple sub-agents, paired with a trajectory-alignment scheme and a decoupled, multi-server training pipeline. The approach enables fixed-size batches despite variable sub-agent invocations and decouples optimization across servers, yielding improved stability and sample efficiency on real-world benchmarks GAIA, XBench-DeepSearch, and WebWalkerQA. Results show that co-training both main and sub-agents outperforms single-agent baselines and fixed-subagent configurations, illustrating the value of role specialization and cross-agent coordination for tool-augmented reasoning tasks.

Abstract

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

Paper Structure

This paper contains 24 sections, 12 equations, 8 figures.

Figures (8)

  • Figure 1: System workflow with coordinated main and sub-agents. A user query is fed to the main agent $\mathcal{M}$, which plans, reasons, and delegates subtasks to specialized sub-agents $\{\mathcal{S}_i\}$ if needed (e.g., visit/browsing and search tools). Sub-agents return structured feedback to $\mathcal{M}$, which integrates evidence, performs verification, and produces the final answer. Both $\mathcal{M}$ and $\mathcal{S}_i$ may iterate via self-verification loops before feedback outputs to $\mathcal{M}$.
  • Figure 2: One rollout with nested $\mathcal{M}\!\to\!\mathcal{S}$ interactions. The main agent $\mathcal{M}$ follows trajectory $\tau_{\mathcal{M}}$ (red) and may distribute subtasks by invoking the sub-agent $\mathcal{S}$ multiple times (e.g., $a_{\mathcal{M}}^{1}$ and $a_{\mathcal{M}}^{3}$). Each invocation generates a sub-trajectory $\tau_{\mathcal{S}_i}$ (blue) that performs tool-use steps and returns a summarized message $o_{\mathcal{S}_i}$ to $\mathcal{M}$. The main trajectory integrates these intermediate results and finally outputs the answer $o_{\mathcal{M}}$.
  • Figure 3: Workflow of the decoupled two-agent architecture with M-GRPO. The Main agent (left) and Sub agent (right) each generate rollouts via their SGL router/server. Main agent logs trajectories and rewards to a shared Database. Sub agent extracts the required rewards from the database and calculates its own rewards for training. A central Agent Controller (middle) coordinates multi-turn interactions, assigns subtasks to the sub agent, and aggregates returned results. Tool calls (reason/search/visit) are executed through a Tool Server. The sub-agent side maintains a cache for sample synchronization. Arrows indicate data and control flow.
  • Figure 4: Trajectory alignment for batch training with variable sub-agent invocations. For each query, we sample $K$ rollouts. Every rollout yields one main-trajectory $\tau_{\mathcal{M}}$ and a variable number of sub-agent trajectories $\{\tau_{\mathcal{S}_i}\}$. Because the number of sub-invocations $d_k$ differs across rollouts, we fix a target $d$ (e.g., $8$) and randomly duplicate or drop $\tau_{\mathcal{S}}$ samples so that each batch contains a consistent count of $\tau_{\mathcal{S}}$ (while keeping a fixed number of $\tau_{\mathcal{M}}$, e.g., $8$). This alignment produces uniform tensor shapes for policy-gradient updates.
  • Figure 5: Reward curve during Stage 1 RL training on simple data. The system shows stable improvement from zero to high rewards, demonstrating effective format acquisition.
  • ...and 3 more figures