Multi-agent cooperation through in-context co-player inference
Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans
TL;DR
This work tackles cooperation among self-interested agents in decentralized multi-agent reinforcement learning by showing that sequence-model agents trained against a diverse co-player distribution develop in-context best-response policies within IPD episodes. The authors introduce Predictive Policy Improvement (PPI), which uses a sequence model as both world model and policy prior, and demonstrate that mixed-pool training against learning and tabular opponents yields robust cooperation without explicit time-scale separation or meta-gradients. They reveal a mechanism where in-context adaptation is vulnerable to extortion, and mutual extortion in a mixed population drives agents toward cooperative behavior, aligning with Nash-like and embedded-equilibrium concepts. The findings suggest a scalable path to cooperative behavior in multi-agent systems using standard decentralized MARL with diverse co-players, connecting with foundation-model paradigms of in-context learning.
Abstract
Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.
