The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, Yi Wu
TL;DR
The paper demonstrates that PPO, when adapted to multi-agent settings as MAPPO and IPPO, can rival state-of-the-art off-policy methods in cooperative MARL across four major benchmarks (MPE, SMAC, GRF, Hanabi) with limited tuning and no domain-specific architectural changes. It provides a systematic analysis of five practical factors—value normalization, value-function inputs, training data usage, PPO clipping, and batch size—offering concrete guidelines that improve stability and sample efficiency. The work emphasizes centralized value inputs (MAPPO) with agent-specific global information and introduces the concept of death masking in dynamic agent populations, both contributing to PPO’s strong empirical performance. The released codebase and comprehensive ablations position PPO-based approaches as robust, accessible baselines for a wide range of cooperative MARL tasks, with implications for practitioners seeking reliable, scalable on-policy methods.
Abstract
Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at \url{https://github.com/marlbenchmark/on-policy}.
