Table of Contents
Fetching ...

ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning

Wen-Tse Chen, Yuxuan Li, Shiyu Huang, Jiayu Chen, Jeff Schneider

TL;DR

ME-IGM addresses a fundamental misalignment in applying maximum entropy to IGM-based cooperative MARL by introducing an order-preserving transformation (OPT) that maps local Q-values to policy logits without breaking the IGM structure. The method enables entropy-driven exploration while guaranteeing that locally chosen actions align with the globally optimal Q-value, via a KL-based or squared-error objective for OPT training and a centralized-decentralized architecture (CTDE) with a hyper-network-enabled mixer. Empirical results across matrix games, Overcooked, and SMAC-v2 demonstrate state-of-the-art performance and stronger coordination, with ablations confirming the necessity of OPT for monotonic policy improvement and effective credit assignment. This approach advances practical multi-agent coordination by marrying maximum entropy exploration with robust IGM-compliant credit assignment, particularly in discrete-action domains. Future work aims to extend ME-IGM to continuous action spaces and broader environments.

Abstract

Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.

ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning

TL;DR

ME-IGM addresses a fundamental misalignment in applying maximum entropy to IGM-based cooperative MARL by introducing an order-preserving transformation (OPT) that maps local Q-values to policy logits without breaking the IGM structure. The method enables entropy-driven exploration while guaranteeing that locally chosen actions align with the globally optimal Q-value, via a KL-based or squared-error objective for OPT training and a centralized-decentralized architecture (CTDE) with a hyper-network-enabled mixer. Empirical results across matrix games, Overcooked, and SMAC-v2 demonstrate state-of-the-art performance and stronger coordination, with ablations confirming the necessity of OPT for monotonic policy improvement and effective credit assignment. This approach advances practical multi-agent coordination by marrying maximum entropy exploration with robust IGM-compliant credit assignment, particularly in discrete-action domains. Future work aims to extend ME-IGM to continuous action spaces and broader environments.

Abstract

Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.
Paper Structure (25 sections, 4 theorems, 19 equations, 7 figures, 15 tables, 1 algorithm)

This paper contains 25 sections, 4 theorems, 19 equations, 7 figures, 15 tables, 1 algorithm.

Key Result

theorem 1

Denote the policy before the policy improvement as $\pi_{\text{old}}$ and the policy achieving the optimality of the improvement step as $\pi_{\text{new}}$. Updating the policy with Equation (eq:kl_divergence) ensures monotonic improvement in the global Q-value. That is, $Q^{\pi_{\text{old}}}_{tot}( In contrast, updating the policy with Equation (eq:real_policy_improvement) only ensures that $Q^{\

Figures (7)

  • Figure 1: The figure illustrates the improvement of our approach compared to existing maximum entropy MARL methods. The left figure shows a straightforward approach to applying maximum entropy MARL in the CTDE context, where blue texts represent the desired objectives, and black texts indicate corresponding constraints. It reveals that existing maximum entropy MARL methods under the CTDE framework implicitly constrain the global Q-value to be the sum of local Q-values (as in VDN), significantly limiting the expressiveness of the critic network. The right figure depicts the improvements in ME-IGM. ME-IGM first applies any credit assignment mechanism that satisfies the IGM condition, such as QMIX and QPLEX, to obtain $Q_i$ for each agent $i$. Given the meaningful order of local Q-values (for different actions), an order-preserving transformation $f_i$ converts $Q_i$ to $f_i(Q_i)$ as the policy logits. This transformation is optimized using a loss function that minimizes the expected difference between $\sum_i f_i(Q_i)$ and $Q_{tot}$, which guarantees monotonic policy improvement in maximum entropy MARL.
  • Figure 2: Illustration of misalignment between local policies and the maximum global Q-value. When naively combining the IGM condition with maximum entropy MARL, local policies often select suboptimal joint actions, leading to a lower Q-value, depicted by the blue curve. In contrast, the optimal joint action should achieve the global-max Q-value, shown as the orange curve.
  • Figure 3: The overall pipeline of ME-IGM.
  • Figure 4: The mean return and standard deviation of QMIX, QPLEX, ME-QMIX , ME-QPLEX in Overcooked. It shows that ME-QMIX and ME-QPLEX achieve higher returns and exhibit faster convergence, compared to QMIX and QPLEX which do not adopt maximum entropy.
  • Figure 5: Ablation study on exploration strategies in QMIX. Comparison between ME-QMIX and a modified QMIX with extended epsilon annealing on the Protoss 5v5 map. The results show that simply increasing epsilon-greedy exploration is insufficient for improving performance. In contrast, ME-QMIX enables QMIX agents to perform more structured exploration through the maximum entropy framework, leading to higher returns.
  • ...and 2 more figures

Theorems & Definitions (5)

  • definition 1
  • theorem 1
  • lemma 1: Joint Soft Policy Improvement
  • theorem 2
  • theorem 3: RMPI Convergence