Revisiting Fundamentals of Experience Replay
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney
TL;DR
This study addresses how experience replay data generation and learning algorithms interact in deep RL, focusing on replay capacity and replay ratio. Through large-scale, controlled experiments with Rainbow and DQN across Atari, it finds that increasing replay capacity often boosts performance, particularly when using $n$-step returns, while the oldest-policy age also matters; notably, uncorrected $n$-step returns prove uniquely beneficial for leveraging larger buffers. The authors show that $n$-step returns can mitigate issues from off-policy data and may reduce variance, explaining part of the capacity benefits, with offline batch RL experiments extending these findings to massive data regimes. Overall, the work clarifies which components drive gains from bigger replay buffers and provides practical insights for designing scalable, off-policy RL systems.
Abstract
Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.
