Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Yonghyeon Jo; Sunwoo Lee; Seungyul Han

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Yonghyeon Jo, Sunwoo Lee, Seungyul Han

TL;DR

Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions and incorporates these sub-value functions into a Softmax-based behavior policy, encourages persistent exploration and enables Q to adjust quickly to the changing optima.

Abstract

Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

TL;DR

Abstract

to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

Paper Structure (46 sections, 2 theorems, 21 equations, 22 figures, 11 tables, 2 algorithms)

This paper contains 46 sections, 2 theorems, 21 equations, 22 figures, 11 tables, 2 algorithms.

Introduction
Preliminaries
Decentralized POMDPs
Value Decomposition under CTDE
Communication in Dec-POMDPs
Related works
CTDE Methods in MARL
Overcoming the Monotonicity Constraint
Communication in MARL
Methodology
Motivation: Overcoming Dynamic Optimality Shifts in MARL
Successive Sub-value Q-learning for Retaining Suboptimal Actions
Coordinated Execution via Communication during Training
Experiments
Performance Analysis
...and 31 more sections

Key Result

Theorem 4.1

Let $Q^*(s_t,\boldsymbol{\tau}_t,\mathbf a_t)$ and $\{Q^{\mathrm{sub}}_k\}_{k=0}^K$ be the joint action-value function and sub-value functions obtained by minimizing equation eq:wqmix and equation eq:succesiveQ, respectively, and let $\{\mathbf a_{0,t}^*,\dots,\mathbf a_{K,t}^*\}$ denote the $K{+}1$ If the reward function $r$ is bounded and suppression factor $\alpha$ is sufficiently large, then,

Figures (22)

Figure 1: Fundamental Limitations of value decomposition algorithms. (a): The actual payoff of the matrix game. (b),(c): Training result of QMIX and WQMIX. (d): Training results of S2Q when $K=2$, where $K$ is the hyperparameter controlling number of sub-networks to use.
Figure 1: Component evaluation of S2Q on SMAC-Hard+ tasks.
Figure 2: Illustration of S2Q framework. Each subnetwork $Q^{sub,k}$ transmits $\mathcal{A}^k$, a set of optimal actions according to all previous subnetworks. $Q^{sub,k+1}$ learns the unrestricted target $Q^*$ while suppressing the Q-values of actions included in $\mathcal{A}^k$.
Figure 3: Overall framework of S2Q
Figure 4: Experiment environments
...and 17 more figures

Theorems & Definitions (3)

Theorem 4.1
Theorem B.1: Successive Q-learning
proof

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

TL;DR

Abstract

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (3)