ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning
Wen-Tse Chen, Yuxuan Li, Shiyu Huang, Jiayu Chen, Jeff Schneider
TL;DR
ME-IGM addresses a fundamental misalignment in applying maximum entropy to IGM-based cooperative MARL by introducing an order-preserving transformation (OPT) that maps local Q-values to policy logits without breaking the IGM structure. The method enables entropy-driven exploration while guaranteeing that locally chosen actions align with the globally optimal Q-value, via a KL-based or squared-error objective for OPT training and a centralized-decentralized architecture (CTDE) with a hyper-network-enabled mixer. Empirical results across matrix games, Overcooked, and SMAC-v2 demonstrate state-of-the-art performance and stronger coordination, with ablations confirming the necessity of OPT for monotonic policy improvement and effective credit assignment. This approach advances practical multi-agent coordination by marrying maximum entropy exploration with robust IGM-compliant credit assignment, particularly in discrete-action domains. Future work aims to extend ME-IGM to continuous action spaces and broader environments.
Abstract
Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.
