Multi-agent cooperation through learning-aware policy gradients
Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino Scherrer, Eric Elmoznino, Blake Richards, Guillaume Lajoie, Blaise Agüera y Arcas, João Sacramento
TL;DR
The paper tackles the difficulty of achieving cooperation among self-interested, independently learning agents in general-sum games by introducing COALA-PG, an unbiased, higher-derivative-free policy gradient for learning-aware reinforcement learning. It formalizes a batched co-player shaping POMDP and leverages long-context sequence models to capture how co-players learn over multiple inner episodes, enabling effective policy updates that shape others’ learning. Through analytical IPD results and extensive experiments on IPD and sequential social dilemmas, COALA-PG demonstrates that learning-awareness can induce extortion against naive learners and, crucially, cooperation among learning-aware agents, especially in heterogeneous groups. The work also clarifies connections to LOLA, showing that COALA-PG can achieve similar cooperative dynamics without higher-order derivatives, with significant implications for scalable, decentralized multi-agent learning and cooperative behavior emergence in complex environments.
Abstract
Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.
