Table of Contents
Fetching ...

Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Yan Wang, Ke Deng, Yongli Ren

TL;DR

CDM poses a barrier to effective CTDE MARL when suboptimal agents degrade others' learning. The authors introduce MCEM-NCD, which extends the Cross-Entropy Method to multiple agents and employs a monotonic nonlinear critic decomposition to factor $Q_{tot}$ into per-agent $Q^a$ while preserving decentralized execution. Off-policy learning is enhanced by a modified $k$-step $\\lambda$-return with Sarsa form and Retrace corrections, improving sample efficiency under a nonlinear decomposition. Empirically, MCEM-NCD outperforms state-of-the-art baselines on both discrete SMAC benchmarks and continuous Predator-Prey tasks, with ablations confirming the value of nonlinear decomposition, off-policy updates, and the percentile-greedy update scheme. Overall, MCEM-NCD offers a scalable, expressive MARL framework that robustly mitigates CDM in complex cooperative environments.

Abstract

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.

Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

TL;DR

CDM poses a barrier to effective CTDE MARL when suboptimal agents degrade others' learning. The authors introduce MCEM-NCD, which extends the Cross-Entropy Method to multiple agents and employs a monotonic nonlinear critic decomposition to factor into per-agent while preserving decentralized execution. Off-policy learning is enhanced by a modified -step -return with Sarsa form and Retrace corrections, improving sample efficiency under a nonlinear decomposition. Empirically, MCEM-NCD outperforms state-of-the-art baselines on both discrete SMAC benchmarks and continuous Predator-Prey tasks, with ablations confirming the value of nonlinear decomposition, off-policy updates, and the percentile-greedy update scheme. Overall, MCEM-NCD offers a scalable, expressive MARL framework that robustly mitigates CDM in complex cooperative environments.

Abstract

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.

Paper Structure

This paper contains 19 sections, 1 theorem, 21 equations, 5 figures.

Key Result

Theorem 5.1

The percentile-greedy policy $\boldsymbol{\pi}_{\rho}$, where $\rho>0$, is guaranteed to be at least as good as the centralized gradient policies $\boldsymbol{\pi}_g$ for any given $\boldsymbol{\tau}$. It can be formulated as eq:distheorem for discrete actions: For continuous actions, it can be formulated as eq:contheorem:

Figures (5)

  • Figure 1: The process flow of decentralized policy learning with multi-agent CEM (MCEM) is marked in blue (detailed in \ref{['sec:policy']}). The process flow of the off-policy method for learning a centralized but factored critic is marked in red (detailed in \ref{['sec:cfc']}).
  • Figure 2: Discrete action tasks - performance measured by median win rate on 9 scenarios in the SMAC benchmark.
  • Figure 3: Continuous action tasks - performance measured by mean episode return on 3 scenarios in continuous Predator-Prey with varying numbers of agents and prey.
  • Figure 4: Ablation study on the discrete tasks (top 3) and on the continuous tasks (bottom 3).
  • Figure 5: Impact of $\rho$.

Theorems & Definitions (2)

  • Theorem 5.1
  • proof