Table of Contents
Fetching ...

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu

TL;DR

QSIM is proposed, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity, and significantly mitigates the systematic value overestimation in MARL.

Abstract

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

TL;DR

QSIM is proposed, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity, and significantly mitigates the systematic value overestimation in MARL.

Abstract

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.
Paper Structure (46 sections, 37 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 46 sections, 37 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Theoretical curves showing how the upper bound of overestimation bias scales with the number of agents $N$. Further details are given in Theorem 1.
  • Figure 2: QSIM framework. (a) Autoencoder: Self-supervised learning of action representations. (b) Action Similarity: Computing cosine similarity between deviating action $a^j_i$ and greedy action $a^*_i$ to derive softmax-normalized weights. (c) Weighted TD Target: Constructing near-greedy joint actions $\bm{c}^j_i$ and aggregating their TD target into the final weighted TD target $Y_{\text{QSIM}}$.
  • Figure 3: Performance comparison on SMAC maps.
  • Figure 4: Comparison of QSIM-enhanced variants with their original baselines across different benchmarks.
  • Figure 5: Ablation study comparing the full QSIM-QMIX model against the unweighted QSIM-Mean variant and the original QMIX baseline.
  • ...and 10 more figures