Table of Contents
Fetching ...

Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning

Shengyi Huang, Quentin Gallouédec, Florian Felten, Antonin Raffin, Rousslan Fernand Julien Dossa, Yanxiao Zhao, Ryan Sullivan, Viktor Makoviychuk, Denys Makoviichuk, Mohamad H. Danesh, Cyril Roumégous, Jiayi Weng, Chufan Chen, Md Masudur Rahman, João G. M. Araújo, Guorui Quan, Daniel Tan, Timo Klein, Rujikorn Charakorn, Mark Towers, Yann Berthelot, Kinal Mehta, Dipam Chakraborty, Arjun KG, Valentin Charraut, Chang Ye, Zichen Liu, Lucas N. Alegre, Alexander Nikulin, Xiao Hu, Tianlin Liu, Jongwook Choi, Brent Yi

TL;DR

This work addresses the reproducibility crisis in reinforcement learning by introducing Open RL Benchmark, a community-driven, fully tracked repository of RL experiments with fixed dependencies and an accompanying CLI for easy data access and visualization. It aggregates libraries, environments, and a rich set of metrics into a reproducible framework, enabling precise reproduction via exact commands and seeds. The paper demonstrates practical utility through case studies on PPO with TD($\lambda$) value estimation and the evaluation of Cleanba, underscoring improved reproducibility and deeper insights beyond traditional paper curves. While offering a scalable, collaborative resource, it also discusses ongoing challenges in usability, standardization of evaluation practices, and long-term maintainability, outlining a path toward elevated reproducibility standards in RL research.

Abstract

In many Reinforcement Learning (RL) papers, learning curves are useful indicators to measure the effectiveness of RL algorithms. However, the complete raw data of the learning curves are rarely available. As a result, it is usually necessary to reproduce the experiments from scratch, which can be time-consuming and error-prone. We present Open RL Benchmark, a set of fully tracked RL experiments, including not only the usual data such as episodic return, but also all algorithm-specific and system metrics. Open RL Benchmark is community-driven: anyone can download, use, and contribute to the data. At the time of writing, more than 25,000 runs have been tracked, for a cumulative duration of more than 8 years. Open RL Benchmark covers a wide range of RL libraries and reference implementations. Special care is taken to ensure that each experiment is precisely reproducible by providing not only the full parameters, but also the versions of the dependencies used to generate it. In addition, Open RL Benchmark comes with a command-line interface (CLI) for easy fetching and generating figures to present the results. In this document, we include two case studies to demonstrate the usefulness of Open RL Benchmark in practice. To the best of our knowledge, Open RL Benchmark is the first RL benchmark of its kind, and the authors hope that it will improve and facilitate the work of researchers in the field.

Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning

TL;DR

This work addresses the reproducibility crisis in reinforcement learning by introducing Open RL Benchmark, a community-driven, fully tracked repository of RL experiments with fixed dependencies and an accompanying CLI for easy data access and visualization. It aggregates libraries, environments, and a rich set of metrics into a reproducible framework, enabling precise reproduction via exact commands and seeds. The paper demonstrates practical utility through case studies on PPO with TD() value estimation and the evaluation of Cleanba, underscoring improved reproducibility and deeper insights beyond traditional paper curves. While offering a scalable, collaborative resource, it also discusses ongoing challenges in usability, standardization of evaluation practices, and long-term maintainability, outlining a path toward elevated reproducibility standards in RL research.

Abstract

In many Reinforcement Learning (RL) papers, learning curves are useful indicators to measure the effectiveness of RL algorithms. However, the complete raw data of the learning curves are rarely available. As a result, it is usually necessary to reproduce the experiments from scratch, which can be time-consuming and error-prone. We present Open RL Benchmark, a set of fully tracked RL experiments, including not only the usual data such as episodic return, but also all algorithm-specific and system metrics. Open RL Benchmark is community-driven: anyone can download, use, and contribute to the data. At the time of writing, more than 25,000 runs have been tracked, for a cumulative duration of more than 8 years. Open RL Benchmark covers a wide range of RL libraries and reference implementations. Special care is taken to ensure that each experiment is precisely reproducible by providing not only the full parameters, but also the versions of the dependencies used to generate it. In addition, Open RL Benchmark comes with a command-line interface (CLI) for easy fetching and generating figures to present the results. In this document, we include two case studies to demonstrate the usefulness of Open RL Benchmark in practice. To the best of our knowledge, Open RL Benchmark is the first RL benchmark of its kind, and the authors hope that it will improve and facilitate the work of researchers in the field.
Paper Structure (28 sections, 2 equations, 20 figures)

This paper contains 28 sections, 2 equations, 20 figures.

Figures (20)

  • Figure 1: Example of learning curves obtained with Open RL Benchmark. These compare the episodic returns achieved by different implementations of PPO and DQN on a number of Atari games.
  • Figure 2: An example of a report on the Weights and Biases platform, dealing with the contribution of QDagger agarwal2022reincarnating, and using data from Open RL Benchmark. The URL to access the report is https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/Atari-CleanRL-s-Qdagger--Vmlldzo0NTg1ODY5
  • Figure 3: CleanRL's module reproduce allows the user to generate, from an Open RL Benchmark run reference, the exact command suite for an identical reproduction of the run.
  • Figure 4: Comparing the original PPO and the PPO with Monte-Carlo (MC) for value estimation. These experiments were conducted over 15 environments, including Atari games, Box2D, and MuJoCo. Plot shows minmax normalized scores with 95% stratified bootstrap CIs.
  • Figure 5: Study of the contribution of GAE for estimating the value used to update the critic in PPO, compared against its variant which uses the MC estimator instead. Figures show the aggregated min-max normalized scores with stratified 95% stratified bootstrap CIs.
  • ...and 15 more figures