Table of Contents
Fetching ...

Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

Ruoning Zhang, Siying Wang, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang, Ruijie Zhang

TL;DR

This paper tackles underestimation in centralized training with decentralized execution caused by monotonic value decomposition. It introduces Optimistic $ε$-Greedy Exploration, anchored by an optimistic updating network, which biases exploration toward sampling optimal actions and strengthens value estimation. The authors prove convergence in probability of the optimistic updates to the maximum reward component and integrate the approach into QMIX, yielding OPT-QMIX. Empirically, OPT-QMIX improves performance and avoids suboptimal convergence across Matrix Game, Predator-Prey, and StarCraft II SMAC scenarios, highlighting the practical impact of guided optimistic exploration in cooperative MARL.

Abstract

The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, due to the representational limitations of traditional monotonic value decomposition methods, algorithms can underestimate optimal actions, leading policies to suboptimal solutions. To address this challenge, we propose Optimistic $ε$-Greedy Exploration, focusing on enhancing exploration to correct value estimations. The underestimation arises from insufficient sampling of optimal actions during exploration, as our analysis indicated. We introduce an optimistic updating network to identify optimal actions and sample actions from its distribution with a probability of $ε$ during exploration, increasing the selection frequency of optimal actions. Experimental results in various environments reveal that the Optimistic $ε$-Greedy Exploration effectively prevents the algorithm from suboptimal solutions and significantly improves its performance compared to other algorithms.

Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

TL;DR

This paper tackles underestimation in centralized training with decentralized execution caused by monotonic value decomposition. It introduces Optimistic -Greedy Exploration, anchored by an optimistic updating network, which biases exploration toward sampling optimal actions and strengthens value estimation. The authors prove convergence in probability of the optimistic updates to the maximum reward component and integrate the approach into QMIX, yielding OPT-QMIX. Empirically, OPT-QMIX improves performance and avoids suboptimal convergence across Matrix Game, Predator-Prey, and StarCraft II SMAC scenarios, highlighting the practical impact of guided optimistic exploration in cooperative MARL.

Abstract

The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, due to the representational limitations of traditional monotonic value decomposition methods, algorithms can underestimate optimal actions, leading policies to suboptimal solutions. To address this challenge, we propose Optimistic -Greedy Exploration, focusing on enhancing exploration to correct value estimations. The underestimation arises from insufficient sampling of optimal actions during exploration, as our analysis indicated. We introduce an optimistic updating network to identify optimal actions and sample actions from its distribution with a probability of during exploration, increasing the selection frequency of optimal actions. Experimental results in various environments reveal that the Optimistic -Greedy Exploration effectively prevents the algorithm from suboptimal solutions and significantly improves its performance compared to other algorithms.

Paper Structure

This paper contains 17 sections, 7 theorems, 26 equations, 3 figures, 2 tables.

Key Result

Lemma 1

For the sequence $\{f_t(x)\}$ defined in Eq. (6), if initialized with $f_0(x)\leq r_{max}$, where $r_{max}=\max_t r_t$, then $\forall t\geq0$, it holds that $f_t(x)\leq r_{max}$.

Figures (3)

  • Figure 1: Overall framework of Optimistic $\epsilon$-Greedy Exploration
  • Figure 2: Experimental Results of Matrix Game and Predator Prey
  • Figure 3: Experimental Results of StarCraft Multi-Agent Challenge

Theorems & Definitions (13)

  • Definition 1
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 3 more