Optimistic Multi-Agent Policy Gradient

Wenshuai Zhao; Yi Zhao; Zhiyuan Li; Juho Kannala; Joni Pajarinen

Optimistic Multi-Agent Policy Gradient

Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen

TL;DR

This work tackles relative overgeneralization (RO) in cooperative multi-agent policy gradient methods by introducing OptiMAPPO, a simple yet effective framework that injects optimism through advantage clipping. By replacing negative advantages with zero (and optionally applying a Leaky ReLU extension), the method supports optimistic updates while preserving a formal fixed-point optimality, as shown by an operator-theoretic analysis. Empirically, OptiMAPPO improves performance on challenging benchmarks (MA-MuJoCo, Overcooked) and outperforms strong MAPG baselines (MAPPO, HAPPO, HATRPO) and optimistic Q-learning baselines in many tasks. The results suggest that targeted optimism in MAPG can mitigate RO and enhance coordination in large-scale, continuous-action, fully cooperative MARL settings, with practical impact for robotics and collaborative AI systems.

Abstract

*Relative overgeneralization* (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results. To address this gap, we propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem. Our approach involves clipping the advantage to eliminate negative values, thereby facilitating optimistic updates in MAPG. The optimism prevents individual agents from quickly converging to a local optimum. Additionally, we provide a formal analysis to show that the proposed method retains optimality at a fixed point. In extensive evaluations on a diverse set of tasks including the *Multi-agent MuJoCo* and *Overcooked* benchmarks, our method outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.

Optimistic Multi-Agent Policy Gradient

TL;DR

Abstract

Paper Structure (38 sections, 1 theorem, 16 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 16 equations, 10 figures, 7 tables, 1 algorithm.

Introduction
Related work
Classic Optimistic Methods
Optimistic Deep Q-Learning
Multi-Agent Policy Gradient Methods
Multi-Agent Exploration Methods
Optimistic Thompson Sampling
Advantage Shaping
Background
Problem Formulation
Optimistic Q-learning
Method
Optimistic MAPPO
Extension to Leaky ReLU operation
Analysis
...and 23 more sections

Key Result

Proposition 4.1

$\pi(\theta^{\ast})$ is a fixed point of $\mathcal{I}_V^{\text{clip}}\circ \mathcal{P}_V$,

Figures (10)

Figure 1: Left: Payoff matrix of the climbing and penalty. Each game has two agents, which select the row and column index respectively to find the maximal element of the matrix. Right: The comparison of the learning process with and without an optimistic update on the Climbing task. It shows that the optimistic update is necessary to solve the RO problem.
Figure 2: Comparisons of average episodic returns on three MA-MuJoCo tasks. OptiMAPPO converges to a better joint policy in these tasks. We plot the mean across 5 random seeds, and the shaded areas denote 95% confidence intervals.
Figure 3: Comparisons of average episodic returns on Overcooked tasks. Our method outperforms or matches strong baselines and hysteretic DQN (hy_dqn in the legend) on tested tasks. Although with optimism, hy_dqn fails to boost good performance.
Figure 4: Ablation experiments on different degrees of optimism in OptiMAPPO. It shows that optimism helps in both tasks to a wide range of degrees. Particularly, in HalfCheetah 6x1, with decreasing $\eta$, i.e. increasing degree of optimism, the performance gradually improves.
Figure 5: The up row shows the episode return of hysteretic DQN with different $\alpha$, while the corresponding average Q values are shown in the bottom row. The Q values gradually increase with increasing degree of optimism, i.e. lower $\alpha$, which may degrade the performance.
...and 5 more figures

Theorems & Definitions (2)

Proposition 4.1
proof : Proof of Proposition \ref{['thm:thm2']}

Optimistic Multi-Agent Policy Gradient

TL;DR

Abstract

Optimistic Multi-Agent Policy Gradient

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)