Table of Contents
Fetching ...

Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer

Yaodong Yang, Guangyong Chen, Hongyao Tang, Furui Liu, Danruo Deng, Pheng Ann Heng

TL;DR

DEMAR tackles multiagent overestimation in value-mixing Q-learning by linking an iterative estimation-optimization analysis to a practical algorithm. It introduces dual ensembles to produce lower, more reliable target estimates for both individual and global Q-values and couples this with a hypernet regularizer to curb the accumulation of bias during online optimization. Theoretical analysis shows how overestimation arises from target estimation and gradient propagation, and DEMAR provides explicit mechanisms to bound both sources. Empirical results on MPE and noisy SMAC show DEMAR stabilizes training and reduces overestimation, with demonstrated generality when extended to other MARL methods.

Abstract

Overestimation in single-agent reinforcement learning has been extensively studied. In contrast, overestimation in the multiagent setting has received comparatively little attention although it increases with the number of agents and leads to severe learning instability. Previous works concentrate on reducing overestimation in the estimation process of target Q-value. They ignore the follow-up optimization process of online Q-network, thus making it hard to fully address the complex multiagent overestimation problem. To solve this challenge, in this study, we first establish an iterative estimation-optimization analysis framework for multiagent value-mixing Q-learning. Our analysis reveals that multiagent overestimation not only comes from the computation of target Q-value but also accumulates in the online Q-network's optimization. Motivated by it, we propose the Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer algorithm to tackle multiagent overestimation from two aspects. First, we extend the random ensemble technique into the estimation of target individual and global Q-values to derive a lower update target. Second, we propose a novel hypernet regularizer on hypernetwork weights and biases to constrain the optimization of online global Q-network to prevent overestimation accumulation. Extensive experiments in MPE and SMAC show that the proposed method successfully addresses overestimation across various tasks.

Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer

TL;DR

DEMAR tackles multiagent overestimation in value-mixing Q-learning by linking an iterative estimation-optimization analysis to a practical algorithm. It introduces dual ensembles to produce lower, more reliable target estimates for both individual and global Q-values and couples this with a hypernet regularizer to curb the accumulation of bias during online optimization. Theoretical analysis shows how overestimation arises from target estimation and gradient propagation, and DEMAR provides explicit mechanisms to bound both sources. Empirical results on MPE and noisy SMAC show DEMAR stabilizes training and reduces overestimation, with demonstrated generality when extended to other MARL methods.

Abstract

Overestimation in single-agent reinforcement learning has been extensively studied. In contrast, overestimation in the multiagent setting has received comparatively little attention although it increases with the number of agents and leads to severe learning instability. Previous works concentrate on reducing overestimation in the estimation process of target Q-value. They ignore the follow-up optimization process of online Q-network, thus making it hard to fully address the complex multiagent overestimation problem. To solve this challenge, in this study, we first establish an iterative estimation-optimization analysis framework for multiagent value-mixing Q-learning. Our analysis reveals that multiagent overestimation not only comes from the computation of target Q-value but also accumulates in the online Q-network's optimization. Motivated by it, we propose the Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer algorithm to tackle multiagent overestimation from two aspects. First, we extend the random ensemble technique into the estimation of target individual and global Q-values to derive a lower update target. Second, we propose a novel hypernet regularizer on hypernetwork weights and biases to constrain the optimization of online global Q-network to prevent overestimation accumulation. Extensive experiments in MPE and SMAC show that the proposed method successfully addresses overestimation across various tasks.

Paper Structure

This paper contains 32 sections, 2 theorems, 28 equations, 12 figures, 2 tables.

Key Result

Lemma 2.1

Let $Q_{tot}$ be a function of $s$ and $Q_{i}$ for $i=1,2,...,N$ where $Q_{i}$ is a function of $s$ and $a_{i}$ for $a_{i} \in A_{i}$. Assuming $l \leq \frac{\partial Q_{tot}}{\partial Q_{i}} \leq L, i=1,2,...,N$ where $l \geq 0$, $L > 0$, and $Q_{i}(s,a_{i})$ is with an independent noise uniformly where $\mathbf{Q}(s', \mathbf{a}'_{i})= (Q_{1}(s',a'_{1}), ..., Q_{N}(s',a'_{N}))$ are individual Q

Figures (12)

  • Figure 1: The framework of DEMAR. The left part involves the dual ensembled multiagent Q-learning while the right part shows the hypernet regularizer on the global Q-network.
  • Figure 2: Results on different MPE scenarios. Figure \ref{['tag']}-\ref{['adversary']} show the learning performance of each method on MPE tasks. Figure \ref{['tagoverestimation']}-\ref{['adversaryoverestimation']} show the estimated global Q-value of each method in the log scale on MPE tasks.
  • Figure 3: Results on different noisy SMAC scenarios. Figure \ref{['5m6m']}-\ref{['10m11m']} show the learning performance of each method on SMAC tasks. Figure \ref{['5m6moverestimation']}-\ref{['10m11moverestimation']} show the estimated global Q-value of each method in the log scale on SMAC tasks.
  • Figure 4: Results of the ablation study on simple_tag and 5m_vs_6m. The w/o ensemble indicates DEMAR without the dual ensembled Q-learning. The w/o regularizer represents DEMAR without the hypernet regularizer.
  • Figure 5: Results of analyzed overestimation terms including $Q_{tot}$, $Q_{i}$, and $\frac{\partial Q_{tot}}{\partial Q_{i}}$ on both simple_adversary and 5m_vs_6m.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Lemma 2.1: Gan et al. gan_stabilizing_2021
  • Theorem 3.1