Table of Contents
Fetching ...

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An

TL;DR

This work identifies gradient-norm instability when applying Group Relative Policy Optimization to multi-agent LLM systems due to a single global baseline misaligning with diverse agent reward distributions. It introduces Dr. MAS, an agent-wise normalization approach that calibrates gradient scales by each agent's reward statistics, supported by theory showing reduced gradient variance and by empirical gains on math reasoning and multi-turn search benchmarks. The authors also provide an end-to-end MAS RL framework for scalable orchestration, flexible per-agent LLM serving, and efficient resource scheduling, demonstrating robust performance improvements and stability across sharing and non-sharing configurations. The results show Dr. MAS not only improves task performance but also enables cost-effective heterogeneous model deployments while preserving stability in complex multi-agent coordination scenarios.

Abstract

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

TL;DR

This work identifies gradient-norm instability when applying Group Relative Policy Optimization to multi-agent LLM systems due to a single global baseline misaligning with diverse agent reward distributions. It introduces Dr. MAS, an agent-wise normalization approach that calibrates gradient scales by each agent's reward statistics, supported by theory showing reduced gradient variance and by empirical gains on math reasoning and multi-turn search benchmarks. The authors also provide an end-to-end MAS RL framework for scalable orchestration, flexible per-agent LLM serving, and efficient resource scheduling, demonstrating robust performance improvements and stability across sharing and non-sharing configurations. The results show Dr. MAS not only improves task performance but also enables cost-effective heterogeneous model deployments while preserving stability in complex multi-agent coordination scenarios.

Abstract

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.
Paper Structure (35 sections, 2 theorems, 17 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 2 theorems, 17 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

Lemma 4.2

Under Assumptions assump:score, for any agent $k$, where $\mu_k \triangleq \frac{1}{|\mathcal{Y}_k|}\sum_{\bm{a}_t^i\in \mathcal{Y}_k} R^i$, $\sigma_k^2 \triangleq \frac{1}{|\mathcal{Y}_k|}\sum_{\bm{a}_t^i\in \mathcal{Y}_k} (R^i - \mu_k)^2$ are the mean and variance when sampling time steps uniformly from $\mathcal{Y}_k$ (i.e., when agent $k$ is act

Figures (8)

  • Figure 1: Algorithm comparison. (a) GRPO with global baseline $(\mu,\sigma)$ can cause unstable gradient norm. (b) Dr. MAS with per-agent normalization $(\mu_k,\sigma_k)$ stabilizes the training of MAS.
  • Figure 2: Overview of multi-agent LLM RL framework. A multi-agent orchestrator manages distributed rollouts, agents are mapped to LLM worker groups with optional LLM sharing, and a shared resource pool schedules actor backends for efficient inference and per-model optimization.
  • Figure 3: Illustration of the orchestrations. Left: Math orchestration uses a two-agent loop, where a solver proposes candidate solutions and a verifier evaluates and either approves or requests refinement. Right: Multi-turn search orchestration uses a hierarchical three-agent pipeline, where a top-level verifier selectively invokes either a search agent to retrieve external information or an answer agent to produce the final result.
  • Figure 4: Comparison of training accuracy and gradient norm between GRPO and Dr. MAS. The results are recorded during multi-agent RL post-training for three-agent search orchestration under LLM non-sharing (Qwen2.5-3B).
  • Figure 5: Performance and efficiency comparison between homogeneous (all 7B models) and heterogeneous (7B for Verifier, 3B for Search/Answer) model assignment on search tasks. Token counts are the average tokens per trajectory for each agent. Cost ($) is estimated using OpenRouter market prices (7B: $0.30/M tokens, 3B: $0.06/M tokens) and reported as the total inference cost over the full test set (51.7k samples).
  • ...and 3 more figures

Theorems & Definitions (4)

  • Lemma 4.2
  • Proposition 4.3: Gradient-Norm Inflation
  • proof
  • proof