Table of Contents
Fetching ...

Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field

Marin Toromanoff, Emilie Wirbel, Fabien Moutarde

TL;DR

This paper addresses reproducibility problems in Deep Reinforcement Learning for Atari games by introducing SABER, a standardized benchmark with explicit environment parameters and evaluation protocols. It evaluates the state-of-the-art Rainbow and proposes Rainbow-IQN, showing that, under standardized conditions, current DRL methods still lag behind human world-record performance, especially when unrestricted play time is allowed. The study analyzes factors limiting Atari performance—reward clipping, exploration, and human priors—and demonstrates that standardized evaluation significantly affects reported results, urging broader adoption of SABER for fair progress tracking. Overall, SABER provides a concrete framework to compare algorithms and diagnose gaps between artificial and human-level play in Atari, with Rainbow-IQN representing a meaningful step forward within this framework.

Abstract

Consistent and reproducible evaluation of Deep Reinforcement Learning (DRL) is not straightforward. In the Arcade Learning Environment (ALE), small changes in environment parameters such as stochasticity or the maximum allowed play time can lead to very different performance. In this work, we discuss the difficulties of comparing different agents trained on ALE. In order to take a step further towards reproducible and comparable DRL, we introduce SABER, a Standardized Atari BEnchmark for general Reinforcement learning algorithms. Our methodology extends previous recommendations and contains a complete set of environment parameters as well as train and test procedures. We then use SABER to evaluate the current state of the art, Rainbow. Furthermore, we introduce a human world records baseline, and argue that previous claims of expert or superhuman performance of DRL might not be accurate. Finally, we propose Rainbow-IQN by extending Rainbow with Implicit Quantile Networks (IQN) leading to new state-of-the-art performance. Source code is available for reproducibility.

Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field

TL;DR

This paper addresses reproducibility problems in Deep Reinforcement Learning for Atari games by introducing SABER, a standardized benchmark with explicit environment parameters and evaluation protocols. It evaluates the state-of-the-art Rainbow and proposes Rainbow-IQN, showing that, under standardized conditions, current DRL methods still lag behind human world-record performance, especially when unrestricted play time is allowed. The study analyzes factors limiting Atari performance—reward clipping, exploration, and human priors—and demonstrates that standardized evaluation significantly affects reported results, urging broader adoption of SABER for fair progress tracking. Overall, SABER provides a concrete framework to compare algorithms and diagnose gaps between artificial and human-level play in Atari, with Rainbow-IQN representing a meaningful step forward within this framework.

Abstract

Consistent and reproducible evaluation of Deep Reinforcement Learning (DRL) is not straightforward. In the Arcade Learning Environment (ALE), small changes in environment parameters such as stochasticity or the maximum allowed play time can lead to very different performance. In this work, we discuss the difficulties of comparing different agents trained on ALE. In order to take a step further towards reproducible and comparable DRL, we introduce SABER, a Standardized Atari BEnchmark for general Reinforcement learning algorithms. Our methodology extends previous recommendations and contains a complete set of environment parameters as well as train and test procedures. We then use SABER to evaluate the current state of the art, Rainbow. Furthermore, we introduce a human world records baseline, and argue that previous claims of expert or superhuman performance of DRL might not be accurate. Finally, we propose Rainbow-IQN by extending Rainbow with Implicit Quantile Networks (IQN) leading to new state-of-the-art performance. Source code is available for reproducibility.

Paper Structure

This paper contains 47 sections, 2 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: ALE Space Invaders
  • Figure 2: World records scores vs. the usual beginner human baseline mnih2015human (log scale).
  • Figure 3: Comparison of Rainbow and Rainbow-IQN on SABER: Median normalized scores with regards to training steps.
  • Figure 4: Comparison of Rainbow and Rainbow-IQN on SABER: classifying performance of agents relatively to the records baseline (at 200M training frames).
  • Figure 5: Median performance comparison for DQN, Rainbow and Rainbow-IQN with regards to training frames. Evaluation time is set at 5 minutes to allow a comparison to DQN.
  • ...and 6 more figures