AgentMixer: Multi-Agent Correlated Policy Factorization

Zhiyuan Li; Wenshuai Zhao; Lijun Wu; Joni Pajarinen

AgentMixer: Multi-Agent Correlated Policy Factorization

Zhiyuan Li, Wenshuai Zhao, Lijun Wu, Joni Pajarinen

TL;DR

AgentMixer tackles coordination in cooperative MARL by enabling correlated policies under partial observability. It introduces Policy Modifier to construct a correlated joint policy and Individual-Global-Consistency to align modes between joint and individual policies, enabling decentralized execution. The authors prove convergence to an $\epsilon$-approximate Correlated Equilibrium and validate the approach on MA-MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey, where it matches or surpasses state-of-the-art methods. This work advances coordination in CTDE settings and offers a practical framework for scalable multi-agent systems.

Abstract

In multi-agent reinforcement learning, centralized training with decentralized execution (CTDE) methods typically assume that agents make decisions based on their local observations independently, which may not lead to a correlated joint policy with coordination. Coordination can be explicitly encouraged during training and individual policies can be trained to imitate the correlated joint policy. However, this may lead to an \textit{asymmetric learning failure} due to the observation mismatch between the joint and individual policies. Inspired by the concept of correlated equilibrium, we introduce a \textit{strategy modification} called AgentMixer that allows agents to correlate their policies. AgentMixer combines individual partially observable policies into a joint fully observable policy non-linearly. To enable decentralized execution, we introduce \textit{Individual-Global-Consistency} to guarantee mode consistency during joint training of the centralized and decentralized policies and prove that AgentMixer converges to an $ε$-approximate Correlated Equilibrium. In the Multi-Agent MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey benchmarks, AgentMixer outperforms or matches state-of-the-art methods.

AgentMixer: Multi-Agent Correlated Policy Factorization

TL;DR

-approximate Correlated Equilibrium and validate the approach on MA-MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey, where it matches or surpasses state-of-the-art methods. This work advances coordination in CTDE settings and offers a practical framework for scalable multi-agent systems.

Abstract

-approximate Correlated Equilibrium. In the Multi-Agent MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey benchmarks, AgentMixer outperforms or matches state-of-the-art methods.

Paper Structure (38 sections, 6 theorems, 27 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 38 sections, 6 theorems, 27 equations, 15 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Background
Decentralized Partially Observable Markov Decision Processes
Equilibrium Notions
AgentMixer
Policy Modifier
Individual Global Consistency
Implementation of IGC
Convergence of AgentMixer
Experiments
Continuous Actions Spaces: MA-MuJoCo
Discrete Action Spaces: SMAC-v2
Ablation Results
Conclusion and Future Work
...and 23 more sections

Key Result

Theorem 1

Given an optimal correlated joint fully observable policy $\pi_{\theta^*}$ being identifiability, the iteration defined by: converges to $\pi_{\eta^*}(a|b)$ that defines an optimal product partially observable policy, as $k \to \infty$.

Figures (15)

Figure 1: The partially observable bridge crossing task. Two agents (blue and orange boxes), with changing physiques (box sizes) in different episodes as shown in the left and right figures, must navigate to their destinations, marked with stars with corresponding colors, through passageways $1$ or $2$ while avoiding congestion. The expert is conditioned on an omniscient state indicating the physiques of all agents, while an agent cannot see the physique of another agent. Naively learning from the full observation expert policy, the agents would never stagger passageways, and instead cross the same passageway directly, incurring congestion.
Figure 2: AgentMixer contains two components: 1) Policy Modifier takes the individual partially observable policies and state as inputs and produces correlated joint fully observable policy as outputs, and 2) Individual-Global-Consistency keeps the mode consistency among the joint policy and individual policies.
Figure 3: AgentMixer outperforms comparison methods on multiple Multi-Agent MuJoCo tasks. Please, see the statistical significance tests in the Appendix for further evidence. Partial observability in MA-MuJoCo is a critical challenge to most baselines.
Figure 4: Ablations on Ant-v2. The large performance gap can be seen between training and testing on AIL, which is caused by asymmetric learning failure. Other baselines fail to learn any effective policies, while AgentMixer obtains superior performance.
Figure 5: Policy Modifier consists of policy embedding layer, agent-mixing MLP and channel-mixing MLP.
...and 10 more figures

Theorems & Definitions (18)

Definition 1: $\epsilon$-approximate Nash Equilibrium
Definition 2: $\epsilon$-approximate Coarse Correlated Equilibrium
Definition 3: $\epsilon$-approximate Correlated Equilibrium
Definition 4: Implicit product policy
Definition 5: Identifiable policy pair
Theorem 1: Convergence of asymmetric distillation
proof
Definition 6: IGC
Theorem 2: Convergence of AgentMixer
proof
...and 8 more

AgentMixer: Multi-Agent Correlated Policy Factorization

TL;DR

Abstract

AgentMixer: Multi-Agent Correlated Policy Factorization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (18)