Table of Contents
Fetching ...

Multi-agent cooperation through in-context co-player inference

Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans

TL;DR

This work tackles cooperation among self-interested agents in decentralized multi-agent reinforcement learning by showing that sequence-model agents trained against a diverse co-player distribution develop in-context best-response policies within IPD episodes. The authors introduce Predictive Policy Improvement (PPI), which uses a sequence model as both world model and policy prior, and demonstrate that mixed-pool training against learning and tabular opponents yields robust cooperation without explicit time-scale separation or meta-gradients. They reveal a mechanism where in-context adaptation is vulnerable to extortion, and mutual extortion in a mixed population drives agents toward cooperative behavior, aligning with Nash-like and embedded-equilibrium concepts. The findings suggest a scalable path to cooperative behavior in multi-agent systems using standard decentralized MARL with diverse co-players, connecting with foundation-model paradigms of in-context learning.

Abstract

Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.

Multi-agent cooperation through in-context co-player inference

TL;DR

This work tackles cooperation among self-interested agents in decentralized multi-agent reinforcement learning by showing that sequence-model agents trained against a diverse co-player distribution develop in-context best-response policies within IPD episodes. The authors introduce Predictive Policy Improvement (PPI), which uses a sequence model as both world model and policy prior, and demonstrate that mixed-pool training against learning and tabular opponents yields robust cooperation without explicit time-scale separation or meta-gradients. They reveal a mechanism where in-context adaptation is vulnerable to extortion, and mutual extortion in a mixed population drives agents toward cooperative behavior, aligning with Nash-like and embedded-equilibrium concepts. The findings suggest a scalable path to cooperative behavior in multi-agent systems using standard decentralized MARL with diverse co-players, connecting with foundation-model paradigms of in-context learning.

Abstract

Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.
Paper Structure (41 sections, 6 theorems, 45 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 6 theorems, 45 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Lemma C.1

$\bar{J}(p_{\phi_k}, \phi_k) = J(p_{\phi_k}, \phi_k)$

Figures (4)

  • Figure 1: Mixed training leads to robust cooperation. RL agents trained against a mix of tabular policies and learning agents converge to cooperation (solid lines). Ablations: Agents trained purely against other learning agents (dotted lines) or with access to explicit co-player identifications (dashed lines) converge to defection, highlighting that in-context inference is a critical factor for the learning of cooperative behaviors with standard decentralized MARL. Error bars indicate standard deviation across 10 random seeds.
  • Figure 2: A--B: Emergence of in-context best response. Performance of PPI agents (trained against random tabular opponents) when evaluated against specific fixed strategies. The agents demonstrate in-context learning, identifying the opponent and converging to the best response within the episode. C--D: Learning to extort in-context learners. Agents trained against a "Fixed In-Context Learner" (an agent pre-trained in Step 1 to best-respond to tabular policies) learn to extort it. The RL agent achieves a higher share of the reward by exploiting the in-context adaptation of its opponent. E--F: From mutual extortion to cooperation. When two agents initialized with extortion policies (from Step 2) play against each other, their mutual attempts to extort their co-player result in the shaping of each other's policy towards more cooperative behavior, both within episodes through in-context learning (F) and across episodes through in-weight learning (E). Error bars indicate standard deviation across 10 random seeds.
  • Figure 3: Emergence of best-response in mixed training. We plot within-episode performance of models trained in Figure \ref{['fig:mixed_training']} before convergence. We observe that both A2C and PPI try to extort their counterpart at the beginning of the episode which subsequently leads to increased levels of cooperation. At the same time, identifying the opponent as a non-tit-for-tat-like tabular policy leads to high defection ratio. Error bars indicate standard deviation across 10 random seeds.
  • Figure 4: A-B: Emergence of in-context best response Performance of A2C trained against random tabular opponents and evaluated after convergence on a set of specific static policies. We denote the final agent as "Fixed In-Context Learner". C-D: Learning to extort in-context learners. Performance of a randomly initialized A2C agent against the Fixed In-Context Learner. E-F: From mutual extortion to cooperation. Two A2C extortion agents initially converge to cooperation when playing against each other, but with time they might collapse to mutual defection depending on the random seed. Error bars correspond to standard deviation over 5 random initializations.

Theorems & Definitions (17)

  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Definition D.1: Global Predictive Equilibrium
  • Definition D.2: Local Predictive Equilibrium
  • Theorem D.3: Existence of Local Predictive Equilibrium
  • proof
  • Definition D.4: Mixed Predictive Equilibrium
  • Theorem D.5: Existence of Mixed Predictive Equilibrium
  • ...and 7 more