Table of Contents
Fetching ...

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang, Yang Liu

TL;DR

DG-PG is proposed, a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others, and preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity.

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

TL;DR

DG-PG is proposed, a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others, and preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity.

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all agents jointly determine each agent's learning signal, so cross-agent noise grows with . In the policy gradient setting, per-agent gradient estimate variance scales as , yielding sample complexity . We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from to , preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity . On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from to -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
Paper Structure (70 sections, 5 theorems, 55 equations, 5 figures, 9 tables)

This paper contains 70 sections, 5 theorems, 55 equations, 5 figures, 9 tables.

Key Result

Theorem 4.1

Let $\boldsymbol{\theta}^*$ be a stationary point of the original cooperative objective, satisfying $\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^*) = \mathbf{0}$. Under Assumptions ass:exogeneity--ass:alignment, $\boldsymbol{\theta}^*$ remains a stationary point of the augmented objective:

Figures (5)

  • Figure 1: Environment characteristics. (a) Bimodal workload distribution: CPU-intensive jobs (60%, Pareto $\alpha{=}1.7$) and memory-intensive jobs (40%, Pareto $\alpha{=}2.2$) with heavy tails. (b) Server heterogeneity across three hardware generations with CPU efficiency spanning $0.70$--$1.08$. (c) Non-stationary job arrival rate with tidal fluctuation (period = 1000 steps) and Gaussian noise.
  • Figure 2: Hyperparameter sensitivity. Training reward at $N{=}50$ for varying guidance weight $\alpha \in \{0.2, 0.4, 0.6, 0.8\}$. All values converge rapidly and achieve similar final performance, confirming robustness to $\alpha$.
  • Figure 3: Controlled comparison. Training reward for $N \in \{2, 5, 10\}$. DG-PG (200 episodes, red) converges rapidly to the Best-Fit reference (dashed). MAPPO and IPPO (500 episodes) remain far below despite $2.5\times$ more training.
  • Figure 4: Scalability. Best checkpoint test reward vs. number of agents $N$. DG-PG closely tracks the Best-Fit heuristic across all scales, while Random degrades with $N$.
  • Figure 5: Scale-invariant convergence. DG-PG training reward across $N \in \{5, 10, 20, 50, 100, 200\}$. All scales converge within $\sim$10 episodes, empirically confirming $\mathcal{O}(1)$ sample complexity.

Theorems & Definitions (15)

  • Theorem 4.1: Nash Invariance
  • proof : Proof Sketch
  • Theorem 4.2: Agent-Independent Variance Bound
  • proof : Proof Sketch
  • Theorem 4.3: Sample Complexity
  • proof : Proof Sketch
  • Proposition 3.1: Alignment with Load Imbalance Reduction
  • proof
  • Lemma 5.1
  • proof
  • ...and 5 more