Table of Contents
Fetching ...

GHQ: Grouped Hybrid Q Learning for Heterogeneous Cooperative Multi-agent Reinforcement Learning

Xiaoyang Yu, Youfang Lin, Xiangsen Wang, Sheng Han, Kai Lv

TL;DR

This work formalizes Local Transition Heterogeneity (LTH) in cooperative heterogeneous MARL and introduces Grouped Hybrid Q-Learning (GHQ) to address it. GHQ combines Grouped IGM Consistency (GIGM), Ideal Object Grouping (IOG), and Inter-Group Mutual Information (IGMI) to coordinate multiple agent groups with group-specific networks and a hybrid value-factorization. Empirical results on original and newly designed asymmetric SMAC maps show GHQ outperforms state-of-the-art baselines, with reduced variance and clearer inter-group coordination, highlighting the importance of heterogeneity-aware grouping. The approach offers a principled framework for tackling LTH in MARL and suggests avenues for scaling to larger, more complex heterogeneous environments.

Abstract

Previous deep multi-agent reinforcement learning (MARL) algorithms have achieved impressive results, typically in homogeneous scenarios. However, heterogeneous scenarios are also very common and usually harder to solve. In this paper, we mainly discuss cooperative heterogeneous MARL problems in Starcraft Multi-Agent Challenges (SMAC) environment. We firstly define and describe the heterogeneous problems in SMAC. In order to comprehensively reveal and study the problem, we make new maps added to the original SMAC maps. We find that baseline algorithms fail to perform well in those heterogeneous maps. To address this issue, we propose the Grouped Individual-Global-Max Consistency (GIGM) and a novel MARL algorithm, Grouped Hybrid Q Learning (GHQ). GHQ separates agents into several groups and keeps individual parameters for each group, along with a novel hybrid structure for factorization. To enhance coordination between groups, we maximize the Inter-group Mutual Information (IGMI) between groups' trajectories. Experiments on original and new heterogeneous maps show the fabulous performance of GHQ compared to other state-of-the-art algorithms.

GHQ: Grouped Hybrid Q Learning for Heterogeneous Cooperative Multi-agent Reinforcement Learning

TL;DR

This work formalizes Local Transition Heterogeneity (LTH) in cooperative heterogeneous MARL and introduces Grouped Hybrid Q-Learning (GHQ) to address it. GHQ combines Grouped IGM Consistency (GIGM), Ideal Object Grouping (IOG), and Inter-Group Mutual Information (IGMI) to coordinate multiple agent groups with group-specific networks and a hybrid value-factorization. Empirical results on original and newly designed asymmetric SMAC maps show GHQ outperforms state-of-the-art baselines, with reduced variance and clearer inter-group coordination, highlighting the importance of heterogeneity-aware grouping. The approach offers a principled framework for tackling LTH in MARL and suggests avenues for scaling to larger, more complex heterogeneous environments.

Abstract

Previous deep multi-agent reinforcement learning (MARL) algorithms have achieved impressive results, typically in homogeneous scenarios. However, heterogeneous scenarios are also very common and usually harder to solve. In this paper, we mainly discuss cooperative heterogeneous MARL problems in Starcraft Multi-Agent Challenges (SMAC) environment. We firstly define and describe the heterogeneous problems in SMAC. In order to comprehensively reveal and study the problem, we make new maps added to the original SMAC maps. We find that baseline algorithms fail to perform well in those heterogeneous maps. To address this issue, we propose the Grouped Individual-Global-Max Consistency (GIGM) and a novel MARL algorithm, Grouped Hybrid Q Learning (GHQ). GHQ separates agents into several groups and keeps individual parameters for each group, along with a novel hybrid structure for factorization. To enhance coordination between groups, we maximize the Inter-group Mutual Information (IGMI) between groups' trajectories. Experiments on original and new heterogeneous maps show the fabulous performance of GHQ compared to other state-of-the-art algorithms.
Paper Structure (29 sections, 1 theorem, 19 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 1 theorem, 19 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Joint Trajectory Condition (JTC): GIGM holds true if the following two conditions are simultaneously satisfied: (i) The global joint trajectory is equivalent to the union of all group trajectories. (ii) The intersection of all group trajectories is empty.

Figures (8)

  • Figure 1: An overall framework of GHQ. $\boldsymbol{\theta}_{\mathcal{G}_m}$ of group $m$ consists of three parts: agent network $\theta_i$, mixing network $\theta_{M_m}$ and inference network $\boldsymbol{\psi}_{\mathcal{G}_m}$. Detailed data-stream for training and executing are shown in Fig. \ref{['fig-data_stream']}. In (a), $\theta_i$ takes $o_i^t$ and $a_i^{t-1}$ as input. It generates $Q_i$ for choosing actions, and $l_{\mathcal{G}_m}$ and $h_{\mathcal{G}_m}$ for calculating $q_{\mathcal{G}_m}$. In (b), $\theta_{M_m}$ takes $Q$ and $s$ for calculating TD loss $\mathcal{L}_{TD_m}$ with hybrid factorization. In (c), $\boldsymbol{\psi}_{\mathcal{G}_m}$ takes $l_{\mathcal{G}_m}$, $h_{\mathcal{G}_m}$ and $l_{\mathcal{G}_n}$ for calculating IGMI loss $\mathcal{L}_{MI_m}$.
  • Figure 2: An overview of data-stream of GHQ. During decentralized executing, agent networks $\theta_i$ and $\theta_j$ generate $Q_i^t$ and $Q_j^t$ for choosing actions $a_i^t$ and $a_j^t$, respectively. The input of $\theta_i$ is the local observation $o_i^t$ and last action $a_i^{t-1}$ of agent $i$ in group $\mathcal{G}_m$. All necessary transition tuples $(\boldsymbol{\tau}, s, r)$ are stored into the replay buffer $\mathcal{D}$. During centralized training, a batch of trajectories $\boldsymbol{\tau}_{\mathcal{G}_m}$ are sampled from $\mathcal{D}$ as the input of $\theta_i$ for calculating $\boldsymbol{Q}_{\mathcal{G}_m}(\boldsymbol{\tau}_{\mathcal{G}_m})$. Then, mixing network $\theta_{M_m}$ takes $\boldsymbol{Q}_{\mathcal{G}_m}(\boldsymbol{\tau}_{\mathcal{G}_m})$ and state $s$ for calculating $\boldsymbol{Q}_{\mathcal{G}_m}(\boldsymbol{\tau}_{\mathcal{G}_m}, \boldsymbol{s})$ and TD loss $\mathcal{L}_{TD_m}$. The GRU hidden states $h_{\mathcal{G}_m}$, $h_{\mathcal{G}_n}$ and the Gaussian distributions $l_{\mathcal{G}_m}$, $l_{\mathcal{G}_n}$ are generated from agent networks $\theta_i$ and $\theta_j$, and are used to calculate IGMI losses $\mathcal{L}_{MI_m}$ and $\mathcal{L}_{MI_n}$. Detailed formulas are shown in section \ref{['IGMI-loss']}.
  • Figure 3: Examples of SMAC maps. The lower two are ours.
  • Figure 4: Results of Value-based Algorithms Comparison.
  • Figure 5: Heat-maps of $U_{spt}$' percentage of health-points of GHQ and QMIX-FT in 6m2m_15m and 6m2m_16m.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 1