Coordination Failure in Cooperative Offline MARL
Callum Rhys Tilbury, Claude Formanek, Louise Beyers, Jonathan P. Shock, Arnu Pretorius
TL;DR
This work analyzes coordination failure in offline multi-agent reinforcement learning under BRUD, showing that learning from static data can drive agents toward suboptimal coordination even when action products are reward-maximising. Using two-player polynomial games, the paper characterises how BRUD can misalign the gradient of the BRUD update with the true reward gradient, and how this miscoordination grows with increased agent interaction. To address this, it introduces Proximal Joint-Action Prioritisation (PJAP), a dataset-sampling strategy that prioritises experiences generated by policies similar to the current joint policy, with concrete instantiations in polynomial games and the MAMuJoCo suite. PJAP is demonstrated to improve convergence and performance by reducing the distance between sampled data and the learner’s current policy, and is proposed as a versatile complement to existing offline MARL remedies such as critic and policy regularisation. The authors also provide an interactive notebook to reproduce their results and emphasize PJAP as a basis for broader investigation into prioritised dataset sampling in offline MARL.
Abstract
Offline multi-agent reinforcement learning (MARL) leverages static datasets of experience to learn optimal multi-agent control. However, learning from static data presents several unique challenges to overcome. In this paper, we focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data, focusing on a common setting we refer to as the 'Best Response Under Data' (BRUD) approach. By using two-player polynomial games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms, which can lead to catastrophic coordination failure in the offline setting. Building on these insights, we propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity during policy learning and demonstrate its effectiveness in detailed experiments. More generally, however, we argue that prioritised dataset sampling is a promising area for innovation in offline MARL that can be combined with other effective approaches such as critic and policy regularisation. Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, in a browser.
