Optimistic Multi-Agent Policy Gradient
Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen
TL;DR
This work tackles relative overgeneralization (RO) in cooperative multi-agent policy gradient methods by introducing OptiMAPPO, a simple yet effective framework that injects optimism through advantage clipping. By replacing negative advantages with zero (and optionally applying a Leaky ReLU extension), the method supports optimistic updates while preserving a formal fixed-point optimality, as shown by an operator-theoretic analysis. Empirically, OptiMAPPO improves performance on challenging benchmarks (MA-MuJoCo, Overcooked) and outperforms strong MAPG baselines (MAPPO, HAPPO, HATRPO) and optimistic Q-learning baselines in many tasks. The results suggest that targeted optimism in MAPG can mitigate RO and enhance coordination in large-scale, continuous-action, fully cooperative MARL settings, with practical impact for robotics and collaborative AI systems.
Abstract
*Relative overgeneralization* (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results. To address this gap, we propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem. Our approach involves clipping the advantage to eliminate negative values, thereby facilitating optimistic updates in MAPG. The optimism prevents individual agents from quickly converging to a local optimum. Additionally, we provide a formal analysis to show that the proposed method retains optimality at a fixed point. In extensive evaluations on a diverse set of tasks including the *Multi-agent MuJoCo* and *Overcooked* benchmarks, our method outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.
