Table of Contents
Fetching ...

Multi-agent cooperation through learning-aware policy gradients

Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino Scherrer, Eric Elmoznino, Blake Richards, Guillaume Lajoie, Blaise Agüera y Arcas, João Sacramento

TL;DR

The paper tackles the difficulty of achieving cooperation among self-interested, independently learning agents in general-sum games by introducing COALA-PG, an unbiased, higher-derivative-free policy gradient for learning-aware reinforcement learning. It formalizes a batched co-player shaping POMDP and leverages long-context sequence models to capture how co-players learn over multiple inner episodes, enabling effective policy updates that shape others’ learning. Through analytical IPD results and extensive experiments on IPD and sequential social dilemmas, COALA-PG demonstrates that learning-awareness can induce extortion against naive learners and, crucially, cooperation among learning-aware agents, especially in heterogeneous groups. The work also clarifies connections to LOLA, showing that COALA-PG can achieve similar cooperative dynamics without higher-order derivatives, with significant implications for scalable, decentralized multi-agent learning and cooperative behavior emergence in complex environments.

Abstract

Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.

Multi-agent cooperation through learning-aware policy gradients

TL;DR

The paper tackles the difficulty of achieving cooperation among self-interested, independently learning agents in general-sum games by introducing COALA-PG, an unbiased, higher-derivative-free policy gradient for learning-aware reinforcement learning. It formalizes a batched co-player shaping POMDP and leverages long-context sequence models to capture how co-players learn over multiple inner episodes, enabling effective policy updates that shape others’ learning. Through analytical IPD results and extensive experiments on IPD and sequential social dilemmas, COALA-PG demonstrates that learning-awareness can induce extortion against naive learners and, crucially, cooperation among learning-aware agents, especially in heterogeneous groups. The work also clarifies connections to LOLA, showing that COALA-PG can achieve similar cooperative dynamics without higher-order derivatives, with significant implications for scalable, decentralized multi-agent learning and cooperative behavior emergence in complex environments.

Abstract

Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.

Paper Structure

This paper contains 48 sections, 4 theorems, 35 equations, 14 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.1

Take the expected shaping return $\bar{J}(\phi^i) = \mathbb{E}_{\bar{P}^{\phi^i}}\left[ \frac{1}{B}\sum_{b=1}^B \sum_{l=0}^{MT} R^{i,b}_l \right]$, with $\bar{P}^{\phi^i}$ the distribution induced by the environment dynamics $\bar{P}_t$, initial state distribution $\bar{P}_i$ and policy $\phi^i$. Th

Figures (14)

  • Figure 1: A. Experience data terminology. Inner-episodes comprise $T$ steps of (inner) game play, played between agents $B$ times in parallel, forming a batch of inner-episodes. A given sequence of $M$ inner-episodes forms a meta-trajectory, thus comprising $MT$ steps of inner game play. The collection of $B$ meta-trajectories forms a meta-episode. B. During game play, a naive agent takes only the current episode context into account for decision making. In contrast, a meta agent takes the full long context into account. Seeing multiple episodes of game play endows a meta agent with learning awareness.
  • Figure 2: Policy update and credit assignment of naive and meta agents. For credit assignment of action $a_l^{i,b}$, a naive agent (left) takes only intra-episode context into account. A COALA agent (right) takes inter-episode context across the batch dimension into account. For policy updates, a naive agent aggregates policy gradients over the inner-batch dimension (dashed blocks) and updates their policy between episode boundaries. In contrast, a COALA agent updates their policy at a lower frequency along the meta-episode dimension.
  • Figure 3: (A) Learning-aware agents learn to extort naive learners, even when initialized with pure defection strategy. (B) An extortion policy developed against naive agents (shaded area period) turns into a cooperative one when playing against another learning-aware agent (M1 & M2). (C) Cooperation emerges within mixed training pools of naive and learning-aware agents, but not in pools of learning-aware agents only. The shaded regions represent the interquartile range (25th to 75th quantiles) across 32 random seeds
  • Figure 4: (A) Performance of two agents trained by LOLA-DICE on the iterated prisoner's dilemma with analytical gradients for various look-ahead steps (only the performance of the first agent is shown). (B) Performance of a randomly initialized naive learner trained against the fixed LOLA 20 look-aheads policy taken from the end of training of (A). (C) Same setting as (A), but with the naive gradient $\lambda \frac{\partial}{\partial \phi^{-i}}J^i(\phi, \phi^{-i})$ added to the LOLA-DICE update, with $\lambda$ a hyperparameter (c.f. Appendix \ref{['app:ipd_analytic']}). Shaded regions indicate standard error computed over 64 seeds.
  • Figure 5: Agents trained by COALA-PG play iterated prisoner's dilemma. (A): When trained against naive agents only, COALA-PG-trained agents extort the latter and reach considerably higher reward than other baseline agents. The stars ($\star$) indicate overlapping curves of the corresponding color at that point (B): When analyzing the behavior of the agents within one meta-episode, we observe COALA-PG-trained agents shaping naive co-players, leading to low defection rate in the beginning, which is then exploited towards the end. M-FOS on the other hand defects from the beginning, achieving lower reward, thus failing to properly optimize the shaping problem. Batch-unaware COALA-PG performs identically to M-FOS and is therefore omitted. (C): Average performance of meta agents playing against other meta agents, when training a group of meta agents against a mixture of naive and other meta agents. Such agents trained with COALA-PG cooperate when playing against each other, but fail to do so when trained with baseline methods. When removing naive agents from the pool, meta agents also fail to cooperate, as predicted in Section \ref{['fig:analytical-IPD']}. Shaded regions indicate standard deviation computed over 5 seeds.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem D.1
  • proof
  • Theorem D.2
  • proof
  • Theorem E.1