Table of Contents
Fetching ...

CORD: Generalizable Cooperation via Role Diversity

Kanefumi Matsuyama, Kefan Su, Jiangxing Wang, Deheng Ye, Zongqing Lu

TL;DR

CORD tackles generalization gaps in cooperative MARL by learning a hierarchical policy in which a high-level controller assigns roles to low-level agents. The method optimizes role entropy under a causal-graph constraint, decomposing the objective into causal inference in role and role heterogeneity, which are turned into intrinsic rewards $r_c$ and $r_d$ and trained end-to-end with environmental rewards. Empirical results on MPE resource collection and SMAC show strong generalization to unseen teams and agents, with ablations confirming the necessity of the constrained entropy objective. While requiring periodic global information for role assignment, CORD offers a principled path to robust, adaptable coordination without predefined teammate policies, advancing practical generalization in real-world multi-agent systems.

Abstract

Cooperative multi-agent reinforcement learning (MARL) aims to develop agents that can collaborate effectively. However, most cooperative MARL methods overfit training agents, making learned policies not generalize well to unseen collaborators, which is a critical issue for real-world deployment. Some methods attempt to address the generalization problem but require prior knowledge or predefined policies of new teammates, limiting real-world applications. To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD. CORD's high-level controller assigns roles to low-level agents by maximizing the role entropy with constraints. We show this constrained objective can be decomposed into causal influence in role that enables reasonable role assignment, and role heterogeneity that yields coherent, non-redundant role clusters. Evaluated on a variety of cooperative multi-agent tasks, CORD achieves better performance than baselines, especially in generalization tests. Ablation studies further demonstrate the efficacy of the constrained objective in generalizable cooperation.

CORD: Generalizable Cooperation via Role Diversity

TL;DR

CORD tackles generalization gaps in cooperative MARL by learning a hierarchical policy in which a high-level controller assigns roles to low-level agents. The method optimizes role entropy under a causal-graph constraint, decomposing the objective into causal inference in role and role heterogeneity, which are turned into intrinsic rewards and and trained end-to-end with environmental rewards. Empirical results on MPE resource collection and SMAC show strong generalization to unseen teams and agents, with ablations confirming the necessity of the constrained entropy objective. While requiring periodic global information for role assignment, CORD offers a principled path to robust, adaptable coordination without predefined teammate policies, advancing practical generalization in real-world multi-agent systems.

Abstract

Cooperative multi-agent reinforcement learning (MARL) aims to develop agents that can collaborate effectively. However, most cooperative MARL methods overfit training agents, making learned policies not generalize well to unseen collaborators, which is a critical issue for real-world deployment. Some methods attempt to address the generalization problem but require prior knowledge or predefined policies of new teammates, limiting real-world applications. To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD. CORD's high-level controller assigns roles to low-level agents by maximizing the role entropy with constraints. We show this constrained objective can be decomposed into causal influence in role that enables reasonable role assignment, and role heterogeneity that yields coherent, non-redundant role clusters. Evaluated on a variety of cooperative multi-agent tasks, CORD achieves better performance than baselines, especially in generalization tests. Ablation studies further demonstrate the efficacy of the constrained objective in generalizable cooperation.
Paper Structure (25 sections, 3 theorems, 17 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 3 theorems, 17 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.4

Suppose that both the prior role distribution $\mathcal{P(\mathbf{c}|\mathbf{q})}$ and the posterior role distribution $\mathcal{P}(\mathbf{c}|\boldsymbol{\bar{I}}, \mathbf{q})$ obey Gaussian distribution and the $\mathbf{c}$-related matrix $A(\mathbf{c})$ satisfies Definition d2, then the entropy o where $\mathcal{I}$ is mutual information, $\boldsymbol{\bar{I}}_i = \boldsymbol{\bar{I}}_t^i$, $\m

Figures (5)

  • Figure 1: Illustration of the causal graph. Black circles are inherent states in the environment. Blue circles represent related information of other agents except for agent $i$. Red circles are related information of agent $i$. Black dashed lines represent the direction of information transmission. Blue solid lines represent the causal effect on the role of agent $i$ and red solid lines are the individual information.
  • Figure 2: Overview of CORD framework. The purple module is the high-level controller network. The blue modules represent agent individual Q-networks and the orange module is the mixing network. To additionally settle entities-based settings, for the network input, $(\mathbf{o}_t, \boldsymbol{a}_{t-1})/\mathbf{x}^\mathcal{E}$ represents different types of observation for different entities, and $\mathbf{M}^{team}_t$ and $\mathbf{M}_t^i$ are masks of team state and observation of agent $i$ respectively.
  • Figure 3: Episode rewards on training and generalization to unseen teams of CORD compared with all baselines in resource collection: \ref{['Fig4.sub.1']} learning curves on training tasks, \ref{['Fig4.sub.2']} generalization to 5-agent team, and \ref{['Fig4.sub.3']} generalization to 6-agent team.
  • Figure 4: Win rates on training and generalization to unseen agents of CORD compared with all baselines in SMAC: \ref{['Fig5.sub.1']} learning curves in 3-7sz, \ref{['Fig5.sub.2']} learning curves in 3-7MMM, \ref{['Fig5.sub.3']} generalization to 5sz, and \ref{['Fig5.sub.4']} generalization to 5MMM, where \ref{['Fig5.sub.3']} and \ref{['Fig5.sub.4']} are the box plot.
  • Figure 5: Win rates on training and generalization to unseen agents of CORD compared with all baselines on more maps in SMAC: \ref{['Fig7.sub.1']} learning curves in 3-7m, \ref{['Fig7.sub.2']} learning curves in 3-7csz, \ref{['Fig7.sub.3']} generalization to 5m, \ref{['Fig7.sub.4']} generalization to 5csz, where \ref{['Fig7.sub.3']} and \ref{['Fig7.sub.4']} are the box plot.

Theorems & Definitions (9)

  • Definition 4.2
  • Definition 4.3
  • Theorem 4.4
  • proof
  • proof
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof