A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem
Paul Barde, Jakob Foerster, Derek Nowrouzezahrai, Amy Zhang
TL;DR
The paper tackles offline coordination in multi-agent RL, identifying Strategy Agreement and Strategy Fine-Tuning as the two core hurdles when no new interactions are possible. It introduces MOMA-PPO, the first model-based offline MARL method, which learns a centralized world-model ensemble to synthesize inter-agent data and trains policies with MAPPO, incorporating uncertainty penalties and adaptive rollouts to prevent model exploitation. Across offline coordination benchmarks, including Iterated Coordination Game and offline MAMuJoCo tasks with partial observability, MOMA-PPO consistently outperforms model-free baselines, demonstrating robust coordination and policy fine-tuning. The work highlights the practical value of world-model-based collaboration for offline multi-agent systems and points to future work in refining world models and extending to broader domains.
Abstract
Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.
