Table of Contents
Fetching ...

Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning

Linjie Xu, Zichuan Liu, Alexander Dockhorn, Diego Perez-Liebana, Jinyu Wang, Lei Song, Jiang Bian

TL;DR

This work tackles the notorious sample inefficiency in multi-agent reinforcement learning by proposing a simple yet effective paradigm: increase the Replay Ratio ($RR$), i.e., perform multiple gradient updates per episode to better exploit collected data. The approach is demonstrated as generally beneficial across three widely-used MARL baselines (VDN, QMIX, QPLEX) on six StarCraft II SMAC tasks, with $RR$ values of 2 or 4 yielding faster convergence and higher final performance, while excessive $RR$ can cause overfitting in some tasks. The authors address potential plasticity loss via Dormant Neural Ratio ($DNR$) analysis and show that shared RNNs help maintain network plasticity, making resets unnecessary except under extreme $RR$. They also explore the computation-versus-data-budget trade-off and compare $RR$ against larger batch sizes and learning rates, finding $RR$ to be a more effective lever for improving sample efficiency in MARL. The work provides open-source code and suggests future directions like adaptive $RR$ and the exploration of higher $RR$ values.

Abstract

One of the notorious issues for Reinforcement Learning (RL) is poor sample efficiency. Compared to single agent RL, the sample efficiency for Multi-Agent Reinforcement Learning (MARL) is more challenging because of its inherent partial observability, non-stationary training, and enormous strategy space. Although much effort has been devoted to developing new methods and enhancing sample efficiency, we look at the widely used episodic training mechanism. In each training step, tens of frames are collected, but only one gradient step is made. We argue that this episodic training could be a source of poor sample efficiency. To better exploit the data already collected, we propose to increase the frequency of the gradient updates per environment interaction (a.k.a. Replay Ratio or Update-To-Data ratio). To show its generality, we evaluate $3$ MARL methods on $6$ SMAC tasks. The empirical results validate that a higher replay ratio significantly improves the sample efficiency for MARL algorithms. The codes to reimplement the results presented in this paper are open-sourced at https://anonymous.4open.science/r/rr_for_MARL-0D83/.

Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning

TL;DR

This work tackles the notorious sample inefficiency in multi-agent reinforcement learning by proposing a simple yet effective paradigm: increase the Replay Ratio (), i.e., perform multiple gradient updates per episode to better exploit collected data. The approach is demonstrated as generally beneficial across three widely-used MARL baselines (VDN, QMIX, QPLEX) on six StarCraft II SMAC tasks, with values of 2 or 4 yielding faster convergence and higher final performance, while excessive can cause overfitting in some tasks. The authors address potential plasticity loss via Dormant Neural Ratio () analysis and show that shared RNNs help maintain network plasticity, making resets unnecessary except under extreme . They also explore the computation-versus-data-budget trade-off and compare against larger batch sizes and learning rates, finding to be a more effective lever for improving sample efficiency in MARL. The work provides open-source code and suggests future directions like adaptive and the exploration of higher values.

Abstract

One of the notorious issues for Reinforcement Learning (RL) is poor sample efficiency. Compared to single agent RL, the sample efficiency for Multi-Agent Reinforcement Learning (MARL) is more challenging because of its inherent partial observability, non-stationary training, and enormous strategy space. Although much effort has been devoted to developing new methods and enhancing sample efficiency, we look at the widely used episodic training mechanism. In each training step, tens of frames are collected, but only one gradient step is made. We argue that this episodic training could be a source of poor sample efficiency. To better exploit the data already collected, we propose to increase the frequency of the gradient updates per environment interaction (a.k.a. Replay Ratio or Update-To-Data ratio). To show its generality, we evaluate MARL methods on SMAC tasks. The empirical results validate that a higher replay ratio significantly improves the sample efficiency for MARL algorithms. The codes to reimplement the results presented in this paper are open-sourced at https://anonymous.4open.science/r/rr_for_MARL-0D83/.
Paper Structure (16 sections, 4 equations, 8 figures, 1 algorithm)

This paper contains 16 sections, 4 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Training pipeline for MARL. $\theta_t$ represents the agent parameter on the $t$-th update. For each update, a batch of trajectories is sampled from the replay buffer for the gradient calculation. Left: Conventional MARL training that takes one back-propagation for each environmental interaction (one episode). Right: MARL training with $RR=N$, where multiple backpropagations are applied for each interaction to better exploit the collected data.
  • Figure 2: The performance of VDN on MMM2 task under different RR values. The results are plotted with standard errors among 5 random seeds.
  • Figure 3: Comparison of common MARL training (i.e., $RR=1$) and using a higher RR (the best performance with $RR\in\{2, 4\}$) in $6$ Starcraft-II tasks. $3$ MARL methods are evaluated and their performances of using $1$ million and $2$ million environmental interactions are visualized.
  • Figure 4: The evaluation performances with different checkpoints from the VDN training.
  • Figure 6: The evaluation performances with different checkpoints from the QPLEX training.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 3.1