Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning
Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr, Pushmeet Kohli, Shimon Whiteson
TL;DR
The paper tackles the instability of combining experience replay with deep multi-agent Q-learning under nonstationarity from learning peers. It proposes two methods: (i) multi-agent importance sampling to weight replay data by the changing joint policy of other agents, and (ii) fingerprints that condition each agent’s Q-function on a low-dimensional summary of others’ policies and training progress. In a decentralised StarCraft micromanagement benchmark, these approaches enable stable learning with replay and outperform baselines, particularly when using feedforward architectures. The work advances scalable, replay-enabled deep MARL in nonstationary, partially observable domains and suggests avenues for extending to actor-critic frameworks and other nonstationary tasks.
Abstract
Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.
