Table of Contents
Fetching ...

Revisiting Fundamentals of Experience Replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney

TL;DR

This study addresses how experience replay data generation and learning algorithms interact in deep RL, focusing on replay capacity and replay ratio. Through large-scale, controlled experiments with Rainbow and DQN across Atari, it finds that increasing replay capacity often boosts performance, particularly when using $n$-step returns, while the oldest-policy age also matters; notably, uncorrected $n$-step returns prove uniquely beneficial for leveraging larger buffers. The authors show that $n$-step returns can mitigate issues from off-policy data and may reduce variance, explaining part of the capacity benefits, with offline batch RL experiments extending these findings to massive data regimes. Overall, the work clarifies which components drive gains from bigger replay buffers and provides practical insights for designing scalable, off-policy RL systems.

Abstract

Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.

Revisiting Fundamentals of Experience Replay

TL;DR

This study addresses how experience replay data generation and learning algorithms interact in deep RL, focusing on replay capacity and replay ratio. Through large-scale, controlled experiments with Rainbow and DQN across Atari, it finds that increasing replay capacity often boosts performance, particularly when using -step returns, while the oldest-policy age also matters; notably, uncorrected -step returns prove uniquely beneficial for leveraging larger buffers. The authors show that -step returns can mitigate issues from off-policy data and may reduce variance, explaining part of the capacity benefits, with offline batch RL experiments extending these findings to massive data regimes. Overall, the work clarifies which components drive gains from bigger replay buffers and provides practical insights for designing scalable, off-policy RL systems.

Abstract

Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.

Paper Structure

This paper contains 32 sections, 1 equation, 19 figures, 1 table.

Figures (19)

  • Figure 1: Replay ratio varies with replay capacity and the age of the oldest policy. The replay ratio for controlling different replay capacities (rows) and different ages of the oldest policy (columns). Bold values of 0.25 are the default replay ratio (one gradient update per four actions) used by mnih2015human.
  • Figure 2: Performance consistently improves with increased replay capacity and generally improves with reducing the age of the oldest policy. Median percentage improvement over the Rainbow baseline when varying the replay capacity and age of the oldest policy in Rainbow on a 14 game subset of Atari. We do not run the two cells in the bottom-right because they are extremely expensive due to the need to collect a large number of transitions per policy.
  • Figure 3: Performance generally improves when trained on data from more recent policies. Training curves for three games each over a sweep of three oldest policy parameters (2.5e5, 2.5e6 and 2.5e7). Performance generally improves significantly with reduced oldest policies except in sparse-reward games such as PrivateEye.
  • Figure 4: Sparse-reward games benefit from data generated by older policies. Median relative improvement of a Rainbow agent with a 10M replay capacity and 250k oldest policy compared to one with 2.5M oldest policy. Decreasing the age of the oldest policy improves performance on most games. However, performance drops significantly on the two hard exploration games, which bucks the trend that data from newer policies is better.
  • Figure 5: Adding $n$-step to DQN enables improvements with larger replay capacities. Median relative improvement of DQN additive variants when increasing replay capacity from 1M to 10M. Bars represent 50% percentile improvement and the lower and upper bound of the error line is denoted by 25% and 75% percentiles, respectively.
  • ...and 14 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3