Coordination Failure in Cooperative Offline MARL

Callum Rhys Tilbury; Claude Formanek; Louise Beyers; Jonathan P. Shock; Arnu Pretorius

Coordination Failure in Cooperative Offline MARL

Callum Rhys Tilbury, Claude Formanek, Louise Beyers, Jonathan P. Shock, Arnu Pretorius

TL;DR

This work analyzes coordination failure in offline multi-agent reinforcement learning under BRUD, showing that learning from static data can drive agents toward suboptimal coordination even when action products are reward-maximising. Using two-player polynomial games, the paper characterises how BRUD can misalign the gradient of the BRUD update with the true reward gradient, and how this miscoordination grows with increased agent interaction. To address this, it introduces Proximal Joint-Action Prioritisation (PJAP), a dataset-sampling strategy that prioritises experiences generated by policies similar to the current joint policy, with concrete instantiations in polynomial games and the MAMuJoCo suite. PJAP is demonstrated to improve convergence and performance by reducing the distance between sampled data and the learner’s current policy, and is proposed as a versatile complement to existing offline MARL remedies such as critic and policy regularisation. The authors also provide an interactive notebook to reproduce their results and emphasize PJAP as a basis for broader investigation into prioritised dataset sampling in offline MARL.

Abstract

Offline multi-agent reinforcement learning (MARL) leverages static datasets of experience to learn optimal multi-agent control. However, learning from static data presents several unique challenges to overcome. In this paper, we focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data, focusing on a common setting we refer to as the 'Best Response Under Data' (BRUD) approach. By using two-player polynomial games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms, which can lead to catastrophic coordination failure in the offline setting. Building on these insights, we propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity during policy learning and demonstrate its effectiveness in detailed experiments. More generally, however, we argue that prioritised dataset sampling is a promising area for innovation in offline MARL that can be combined with other effective approaches such as critic and policy regularisation. Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, in a browser.

Coordination Failure in Cooperative Offline MARL

TL;DR

Abstract

Paper Structure (19 sections, 9 equations, 6 figures)

This paper contains 19 sections, 9 equations, 6 figures.

Introduction
Foundations
Multi-Agent Reinforcement Learning
Joint Action Formulation
Polynomial Games
Coordination Failure in Offline MARL
Connections to Off-Policy Learning
Growing Risk of Miscoordination with Increased Agent Interaction
Decoupled Rewards: $R = a_x + a_y$.
Sign Agreement: $R=a_x a_y$.
Action Agreement: $R=-(a_x-a_y)^2$.
Twin Peaks: $R=-A(a_x^2+a_y^2) - B(a_x a_y)^2 + Ca_x a_y,\; \{A>0, B>0, C>2A\}$.
Remark
Proximal Joint-Action Prioritisation for Offline Learning
PJAP in Polynomial Games
...and 4 more sections

Figures (6)

Figure 1: Illustration of catastrophic miscoordination when agents each learn based on a best response to the data of other agent actions (BRUD). We consider using a datapoint $\mathbf{a}_{(t)}$, in a simple game where the reward is given by the product of each agent's action, $R(a_x, a_y)=a_x a_y$. The best response of agent $x$, in response to the other agent's negative data point, $a_{y(t)} < 0$, is to make its own policy $\mu(\theta_x)$ more negative. Similarly, agent $y$ updates $\mu(\theta_y)$ to be more positive, in response to the other agent's positive data point, $a_{x(t)}>0$. Alas, the BRUD step moves the joint policy in the opposite direction of optimal increase.
Figure 2: Demonstrating the impact of replay buffer size, as a proxy for off-policyness, on the policy learning with online MADDPG. We show the learning trajectory in policy space (top), the learning as the reward over time (middle), and the state of the replay buffer in the final training update (bottom). We see that increasing the buffer size leads to less optimal trajectories being learnt, due to the presence of the stale data in the replay buffer. With the BRUD update, we can see that it is important for the sampled joint action to remain fairly close to the current joint policy, to avoid miscoordination.
Figure 3: The results of using a uniform dataset $\mathcal{B}$ for offline MADDPG policy learning, in the sign-agreement game, $R(a_x, a_y) = a_xa_y$. We see that the net direction of policy learning is predetermined by the mean of the dataset, due to the BRUD approach, regardless of the policy initialisation.
Figure 4: Visualisations from the Twin Peaks game (with $A=1, B=4, C=5$). We see that with an origin-centred dataset (\ref{['fig:twin-peaks-results-origin']}), offline BRUD learning cannot find the true policy optimum, regardless of the dataset variance, always simply converging to the origin. With an optimum-centred dataset (\ref{['fig:twin-peaks-results-optimum']}), optimality in the learnt policy is only found if the variance is zero. As the variance increases, the learnt policy moves away from the true optimum and towards the origin. These empirical results validate the analytical solutions.
Figure 5: Results of using PJAP with MADDPG in the Twin Peaks game, fixing the problem previously seen in Figure \ref{['fig:twin-peaks-results-optimum']}. Each row uses a specific dataset: all centred on the true optimum, but with increasing variances, shown in the first column. The corresponding trajectories of using MADDPG with and without PJAP are shown in the second column. The third column shows how using PJAP lowers the mean distance between sampled data and the current policy, which enables convergence to higher performance, seen in the fourth column.
...and 1 more figures

Coordination Failure in Cooperative Offline MARL

TL;DR

Abstract

Coordination Failure in Cooperative Offline MARL

Authors

TL;DR

Abstract

Table of Contents

Figures (6)