Table of Contents
Fetching ...

Improving Global Parameter-sharing in Physically Heterogeneous Multi-agent Reinforcement Learning with Unified Action Space

Xiaoyang Yu, Youfang Lin, Shuo Wang, Kai Lv, Sheng Han

TL;DR

The paper tackles coordination challenges in physically heterogeneous MARL where naive global parameter-sharing hampers cooperation due to differing action semantics. It introduces the Unified Action Space ($A^U$) and Cross-Group Inverse loss ($L_{CGI}$) to enable shared parameters while preserving semantic distinctions, and demonstrates their effectiveness with U-MAPPO and U-QMIX on SMAC. The key contributions are the semantic-aware sharing framework, the CGI training objective, and two practical algorithms that outperform state-of-the-art baselines in both value-based and policy-based MARL under heterogeneity. This approach improves learning efficiency and coordination in complex multi-agent settings, with potential for broader adoption in heterogeneous domains and future work on other heterogeneity types and communication-enabled setups.

Abstract

In a multi-agent system (MAS), action semantics indicates the different influences of agents' actions toward other entities, and can be used to divide agents into groups in a physically heterogeneous MAS. Previous multi-agent reinforcement learning (MARL) algorithms apply global parameter-sharing across different types of heterogeneous agents without careful discrimination of different action semantics. This common implementation decreases the cooperation and coordination between agents in complex situations. However, fully independent agent parameters dramatically increase the computational cost and training difficulty. In order to benefit from the usage of different action semantics while also maintaining a proper parameter-sharing structure, we introduce the Unified Action Space (UAS) to fulfill the requirement. The UAS is the union set of all agent actions with different semantics. All agents first calculate their unified representation in the UAS, and then generate their heterogeneous action policies using different available-action-masks. To further improve the training of extra UAS parameters, we introduce a Cross-Group Inverse (CGI) loss to predict other groups' agent policies with the trajectory information. As a universal method for solving the physically heterogeneous MARL problem, we implement the UAS adding to both value-based and policy-based MARL algorithms, and propose two practical algorithms: U-QMIX and U-MAPPO. Experimental results in the SMAC environment prove the effectiveness of both U-QMIX and U-MAPPO compared with several state-of-the-art MARL methods.

Improving Global Parameter-sharing in Physically Heterogeneous Multi-agent Reinforcement Learning with Unified Action Space

TL;DR

The paper tackles coordination challenges in physically heterogeneous MARL where naive global parameter-sharing hampers cooperation due to differing action semantics. It introduces the Unified Action Space () and Cross-Group Inverse loss () to enable shared parameters while preserving semantic distinctions, and demonstrates their effectiveness with U-MAPPO and U-QMIX on SMAC. The key contributions are the semantic-aware sharing framework, the CGI training objective, and two practical algorithms that outperform state-of-the-art baselines in both value-based and policy-based MARL under heterogeneity. This approach improves learning efficiency and coordination in complex multi-agent settings, with potential for broader adoption in heterogeneous domains and future work on other heterogeneity types and communication-enabled setups.

Abstract

In a multi-agent system (MAS), action semantics indicates the different influences of agents' actions toward other entities, and can be used to divide agents into groups in a physically heterogeneous MAS. Previous multi-agent reinforcement learning (MARL) algorithms apply global parameter-sharing across different types of heterogeneous agents without careful discrimination of different action semantics. This common implementation decreases the cooperation and coordination between agents in complex situations. However, fully independent agent parameters dramatically increase the computational cost and training difficulty. In order to benefit from the usage of different action semantics while also maintaining a proper parameter-sharing structure, we introduce the Unified Action Space (UAS) to fulfill the requirement. The UAS is the union set of all agent actions with different semantics. All agents first calculate their unified representation in the UAS, and then generate their heterogeneous action policies using different available-action-masks. To further improve the training of extra UAS parameters, we introduce a Cross-Group Inverse (CGI) loss to predict other groups' agent policies with the trajectory information. As a universal method for solving the physically heterogeneous MARL problem, we implement the UAS adding to both value-based and policy-based MARL algorithms, and propose two practical algorithms: U-QMIX and U-MAPPO. Experimental results in the SMAC environment prove the effectiveness of both U-QMIX and U-MAPPO compared with several state-of-the-art MARL methods.
Paper Structure (23 sections, 1 theorem, 13 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 1 theorem, 13 equations, 7 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Considering a fully-cooperative physically heterogeneous MARL problem. Let there be 2 groups of agents, marked as $G_0$ and $G_1$. Both groups have $N$ agents. The actual action space of $G_0$ is $A_0 = \{ a_0, ..., a_m\}$, while the actual action space of $G_1$ is $A_1 = \{ a_0, ..., a_m, ..., a_ Let $J^*$ be the optimal joint reward, and $J^*_{\rho}$ be the optimal joint reward under the param

Figures (7)

  • Figure 1: An illustration of the usage of UAS.
  • Figure 2: An overall framework of U-MAPPO. The overall network consists of three parts: actor network $\boldsymbol{\theta}$, predictor network $\boldsymbol{\psi}$ and critic network $\boldsymbol{\phi}$. The actor network $\boldsymbol{\theta}$ and the predictor network $\boldsymbol{\psi}$ are shared by all agents. They take their corresponding inputs and generate the UAS policies $\boldsymbol{\pi}_{uni}$. Then the $\boldsymbol{\pi}_{uni}$ are masked by different available-action-masks $AM$ to generate the joint action policy $\pi_{\boldsymbol{\theta}}$ and the joint inverse policy $\rho^{inv}_{\boldsymbol{\psi}}$ separately. The critic network $\boldsymbol{\phi}$ is a global network to compute the global value function $V_{\boldsymbol{\phi}}(s)$ during training.
  • Figure 3: An overall framework of U-QMIX. The overall network consists of two parts: Q network $\boldsymbol{\theta}$ and predictor network $\boldsymbol{\psi}$. The Q network includes the local Q network $\boldsymbol{\theta}_i$ and the mixing network $\boldsymbol{\theta}_M$. The local Q network $\boldsymbol{\theta}_i$ and the predictor network $\boldsymbol{\psi}$ are shared by all agents. They take their corresponding inputs and generate the UAS Q values $\boldsymbol{Q}_{uni}$. Then the $\boldsymbol{Q}_{uni}$ are masked by different available-action-masks $AM$ to generate different Q values for calculating different losses. The mixing network $\boldsymbol{\theta}_M$ is a global network to compute the $Q_{tot}$ during training.
  • Figure 4: Results of Value-based Algorithms Comparison.
  • Figure 5: Results of Policy-based Algorithms Comparison.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Definition 1