Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field
Marin Toromanoff, Emilie Wirbel, Fabien Moutarde
TL;DR
This paper addresses reproducibility problems in Deep Reinforcement Learning for Atari games by introducing SABER, a standardized benchmark with explicit environment parameters and evaluation protocols. It evaluates the state-of-the-art Rainbow and proposes Rainbow-IQN, showing that, under standardized conditions, current DRL methods still lag behind human world-record performance, especially when unrestricted play time is allowed. The study analyzes factors limiting Atari performance—reward clipping, exploration, and human priors—and demonstrates that standardized evaluation significantly affects reported results, urging broader adoption of SABER for fair progress tracking. Overall, SABER provides a concrete framework to compare algorithms and diagnose gaps between artificial and human-level play in Atari, with Rainbow-IQN representing a meaningful step forward within this framework.
Abstract
Consistent and reproducible evaluation of Deep Reinforcement Learning (DRL) is not straightforward. In the Arcade Learning Environment (ALE), small changes in environment parameters such as stochasticity or the maximum allowed play time can lead to very different performance. In this work, we discuss the difficulties of comparing different agents trained on ALE. In order to take a step further towards reproducible and comparable DRL, we introduce SABER, a Standardized Atari BEnchmark for general Reinforcement learning algorithms. Our methodology extends previous recommendations and contains a complete set of environment parameters as well as train and test procedures. We then use SABER to evaluate the current state of the art, Rainbow. Furthermore, we introduce a human world records baseline, and argue that previous claims of expert or superhuman performance of DRL might not be accurate. Finally, we propose Rainbow-IQN by extending Rainbow with Implicit Quantile Networks (IQN) leading to new state-of-the-art performance. Source code is available for reproducibility.
