Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang; Yang Liu

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang, Yang Liu

TL;DR

DG-PG is proposed, a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others, and preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity.

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

TL;DR

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all

agents jointly determine each agent's learning signal, so cross-agent noise grows with

. In the policy gradient setting, per-agent gradient estimate variance scales as

, yielding sample complexity

. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from

, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity

. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from

-- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.

Paper Structure (70 sections, 5 theorems, 55 equations, 5 figures, 9 tables)

This paper contains 70 sections, 5 theorems, 55 equations, 5 figures, 9 tables.

Introduction
Preliminaries
Cooperative Objective and Policy Gradient
The Variance Explosion Problem
Method: Descent-Guided Policy Gradient
Leveraging Analytical Priors: System State and Reference
Descent-Guided Formulation
Implementation
Theoretical Guarantees
Consistency (Nash Invariance)
Variance Reduction
Convergence Rate
Related Work
Experiments
Setup
...and 55 more sections

Key Result

Theorem 4.1

Let $\boldsymbol{\theta}^*$ be a stationary point of the original cooperative objective, satisfying $\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^*) = \mathbf{0}$. Under Assumptions ass:exogeneity--ass:alignment, $\boldsymbol{\theta}^*$ remains a stationary point of the augmented objective:

Figures (5)

Figure 1: Environment characteristics. (a) Bimodal workload distribution: CPU-intensive jobs (60%, Pareto $\alpha{=}1.7$) and memory-intensive jobs (40%, Pareto $\alpha{=}2.2$) with heavy tails. (b) Server heterogeneity across three hardware generations with CPU efficiency spanning $0.70$--$1.08$. (c) Non-stationary job arrival rate with tidal fluctuation (period = 1000 steps) and Gaussian noise.
Figure 2: Hyperparameter sensitivity. Training reward at $N{=}50$ for varying guidance weight $\alpha \in \{0.2, 0.4, 0.6, 0.8\}$. All values converge rapidly and achieve similar final performance, confirming robustness to $\alpha$.
Figure 3: Controlled comparison. Training reward for $N \in \{2, 5, 10\}$. DG-PG (200 episodes, red) converges rapidly to the Best-Fit reference (dashed). MAPPO and IPPO (500 episodes) remain far below despite $2.5\times$ more training.
Figure 4: Scalability. Best checkpoint test reward vs. number of agents $N$. DG-PG closely tracks the Best-Fit heuristic across all scales, while Random degrades with $N$.
Figure 5: Scale-invariant convergence. DG-PG training reward across $N \in \{5, 10, 20, 50, 100, 200\}$. All scales converge within $\sim$10 episodes, empirically confirming $\mathcal{O}(1)$ sample complexity.

Theorems & Definitions (15)

Theorem 4.1: Nash Invariance
proof : Proof Sketch
Theorem 4.2: Agent-Independent Variance Bound
proof : Proof Sketch
Theorem 4.3: Sample Complexity
proof : Proof Sketch
Proposition 3.1: Alignment with Load Imbalance Reduction
proof
Lemma 5.1
proof
...and 5 more

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

TL;DR

Abstract

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (15)