Table of Contents
Fetching ...

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning

Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr, Pushmeet Kohli, Shimon Whiteson

TL;DR

The paper tackles the instability of combining experience replay with deep multi-agent Q-learning under nonstationarity from learning peers. It proposes two methods: (i) multi-agent importance sampling to weight replay data by the changing joint policy of other agents, and (ii) fingerprints that condition each agent’s Q-function on a low-dimensional summary of others’ policies and training progress. In a decentralised StarCraft micromanagement benchmark, these approaches enable stable learning with replay and outperform baselines, particularly when using feedforward architectures. The work advances scalable, replay-enabled deep MARL in nonstationary, partially observable domains and suggests avenues for extending to actor-critic frameworks and other nonstationary tasks.

Abstract

Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning

TL;DR

The paper tackles the instability of combining experience replay with deep multi-agent Q-learning under nonstationarity from learning peers. It proposes two methods: (i) multi-agent importance sampling to weight replay data by the changing joint policy of other agents, and (ii) fingerprints that condition each agent’s Q-function on a low-dimensional summary of others’ policies and training progress. In a decentralised StarCraft micromanagement benchmark, these approaches enable stable learning with replay and outperform baselines, particularly when using feedforward architectures. The work advances scalable, replay-enabled deep MARL in nonstationary, partially observable domains and suggests avenues for extending to actor-critic frameworks and other nonstationary tasks.

Abstract

Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.

Paper Structure

This paper contains 16 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: An example of the observations obtained by all agents at each time step $t$. The function f provides a set of features for each unit in the agent's field of view, which are concatenated. The feature set is {distance, relative x, relative y, health points, weapon cooldown}. Each quantity is normalised by its maximum possible value.
  • Figure 2: Performance of our methods compared to the two baselines XP and NOXP, for both RNN and FF; (a) and (b) show the 3v3 setting, in which IS and FP are only required with feed-forward networks; (c) and (d) show the 5v5 setting, in which FP clearly improves performance over the baselines, while IS shows a small improvement only in the feedforward setting. Overall, the FP is a more effective method for resolving the nonstationarity and there is no additional benefit from combining IS with FP. Confidence intervals show one standard deviation of the sample mean.
  • Figure 3: Estimated value of a single initial observation with different $\epsilon$ in its fingerprint input, at different stages of training. The network learns to smoothly vary its value estimates across different stages of training.
  • Figure 4: (upper) Sampled trajectories of agents, from the beginning (a) and end (b) of training. Each agent is one colour and the starting points are marked as black squares. (lower) Linear regression predictions of $\epsilon$ from the hidden state halfway through each episode in the replay buffer: (c) with only XP, the hidden state still contains disambiguating information drawn from the trajectories, (d) with XP+FP, the hidden state is more informative about the stage of training.