Table of Contents
Fetching ...

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi

TL;DR

Graph-GRPO is proposed, a novel topology optimization framework that integrates Group Relative Policy Optimization that significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

Abstract

Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

TL;DR

Graph-GRPO is proposed, a novel topology optimization framework that integrates Group Relative Policy Optimization that significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

Abstract

Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.
Paper Structure (29 sections, 7 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 7 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Motivation Analysis: The Trap of Non-Informative Batches in Easy Queries. The figure illustrates a scenario where a task is simple enough that diverse sampled topologies (Samples 1--4, ranging from efficient chains to dense structures with redundant edges) all yield correct answers and identical rewards ($R_k=1$). (Top Right) Standard policy gradient methods like REINFORCE use raw rewards. Since $R_k \equiv 1$ across the entire group, the gradient estimation indiscriminately reinforces all sampled edges, including noise and redundancies (e.g., extra edges in S3 & S4), leading to suboptimal convergence. (Bottom Right) Our proposed Graph-GRPO addresses this by incorporating a group baseline $\mu$. In such uniform-reward scenarios, $\mu$ equals individual rewards, resulting in near-zero advantage ($A_{ij} \approx 0$). This mechanism effectively blocks parameter updates from non-informative batches, preventing the model from learning redundant structures from noise.
  • Figure 2: The overall framework ofGraph-GRPO. (1) Policy Network & Construction: The module encodes agent roles and the task query using a GAT-based encoder to generate a probabilistic connectivity matrix $P_\theta$, constrained by a DAG mask to ensure acyclic flow. (2) Group Sampling (Exploration): Instead of a single estimation, we generate a group of $K$ diverse topologies via independent Bernoulli sampling. This exploration captures various structural patterns, where successful topologies receive positive rewards (Reward=1) and failures (e.g., disconnected graphs) receive zero. (3) Edge-Level Graph-GRPO: The core optimization phase. We calculate a group baseline $\mu$ and estimate the specific advantage of each target edge $e_{ij}$. Edges that result in a success rate higher than the baseline ($A_{ij} > 0$) are reinforced, iteratively updating the policy parameters $\theta$.
  • Figure 3: Token efficiency analysis on MMLU and GSM8K benchmarks. The bubble size represents the relative token consumption. Graph-GRPO (Red) achieves the highest accuracy (positioned furthest to the right) while maintaining a low token cost comparable to EIB-LEARNER (Purple) and G-Designer (Pink). Our method effectively suppresses redundant edges without explicit pruning constraints, achieving a superior performance-efficiency trade-off compared to complete graphs (Blue) and debate-based baselines (Brown).