Table of Contents
Fetching ...

Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

TL;DR

Stronger-MAS introduces AT-GRPO, an agent- and turn-wise grouped reinforcement learning method tailored for on-policy training in multi-agent LLM systems. It is paired with a novel MAS training system that supports concurrent updates for multiple policies across diverse workflows. Across game, planning, coding, and math tasks with Qwen3-1.7B and 8B models, AT-GRPO delivers substantial gains, particularly in long-horizon planning, and reveals task-dependent benefits of role specialization. The work provides practical guidance on scaling cooperative LLMs and presents a general MARL framework for MAS that balances algorithmic innovation with system-level design.

Abstract

Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs

TL;DR

Stronger-MAS introduces AT-GRPO, an agent- and turn-wise grouped reinforcement learning method tailored for on-policy training in multi-agent LLM systems. It is paired with a novel MAS training system that supports concurrent updates for multiple policies across diverse workflows. Across game, planning, coding, and math tasks with Qwen3-1.7B and 8B models, AT-GRPO delivers substantial gains, particularly in long-horizon planning, and reveals task-dependent benefits of role specialization. The work provides practical guidance on scaling cooperative LLMs and presents a general MARL framework for MAS that balances algorithmic innovation with system-level design.

Abstract

Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

Paper Structure

This paper contains 77 sections, 2 theorems, 72 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

Under Assumption ass:outcome-aligned, the set of actions maximizing the verifiable reward is a subset of the actions maximizing the optimal $Q$-function:

Figures (6)

  • Figure 1: MAS+AT-GRPO vs. Single-agent+GRPO. The gray line denotes the prompt-only MAS baseline.
  • Figure 2: MAS workflow across different domains. (a) Role-based coordination: code generation via a coder–tester loop. (b) Different task-specific workflows for Game/Plan, Code, and Math; see Sec. \ref{['workflow in exp']} and Appendix \ref{['app:prompt']} for workflow details.
  • Figure 3: Two sampling schemes. (a) In parallel sampling, trajectories are sampled but incomparable, leading to groups of size 1. (b) In tree sampling, branching at each turn forms a valid comparison group of size $K$.
  • Figure 4: MAS training system. Each LLM $m$ has a GPU-pinned model pool with a RolloutWorker and an UpdateWorker. A CPU environment pool hosts envworkers that execute environment steps. Trajectories are routed to the corresponding UpdateWorker.
  • Figure 5: (a) The system aggregates outputs from an ensemble of $N$ Reasoners and $M$ Tool-Users into a Judge. The total agent count scales as $M+N+1$, allowing for flexible resource allocation. (b) Evaluation on AIME24 (using Qwen3-8B).
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 1: Equivalence of Maximizers
  • proof
  • Proposition 1: Optimality of Verifier-Greedy Policy
  • proof