ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

Jannis Becktepe; Julian Dierkes; Carolin Benjamins; Aditya Mohan; David Salinas; Raghu Rajan; Frank Hutter; Holger Hoos; Marius Lindauer; Theresa Eimer

ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

Jannis Becktepe, Julian Dierkes, Carolin Benjamins, Aditya Mohan, David Salinas, Raghu Rajan, Frank Hutter, Holger Hoos, Marius Lindauer, Theresa Eimer

TL;DR

ARLBench tackles the high cost and comparability issues of hyperparameter optimization in reinforcement learning by introducing a flexible benchmark and a large landscape dataset. It provides a Gym-like AutoRL Environment, efficient JAX-based RL implementations (PPO, DQN, SAC), and a principled subset selection method to cover the RL task space with high fidelity. The work demonstrates substantial run-time reductions while preserving the relative performance of HPO methods, and it releases extensive data and tooling to support reproducibility and future surrogate modeling or NAS integration. Overall, ARLBench democratizes AutoRL research by lowering compute barriers and offering a scalable, extensible platform for robust HPO evaluation in RL.

Abstract

Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.

ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 49 figures, 16 tables)

This paper contains 33 sections, 3 equations, 49 figures, 16 tables.

Introduction
Related Work: Benchmarking HPO for RL
Implementing ARLBench
Benchmark Desiderata for ARLBench
The HPO Interface: The AutoRL Environment
RL Training
Finding Representative Benchmarking Settings
Data Collection
Subset Selection
Validating ARLBench
Limitations and Future Work
Conclusion
Dataset Description
Reproducing Our Results
Execution Environment
...and 18 more sections

Figures (49)

Figure 1: Running time comparison for an HPO method of $32$ RL runs using $10$ seeds each on the full environment set and our subsets between ARLBench and StableBaselines3 (SB3) Raffin-jmlr21. This results in speedup factors due to JAX of 3.59 for PPO, 2.87 for DQN, and 5.78 for SAC of ARLBench, compared to SB3 on the full set. The subset selection further decreases the running time by a factor of 2.67 for PPO, 2.49 for DQN, and 2.0 for SAC. Comparing ARLBench on the subset to SB3 on the full set, the total speedups are 9.6 for PPO, 7.14 for DQN, and 11.61 for SAC. Running time comparisons for each environment category can be found in Appendix \ref{['app:performance_comp']}. Note the bars for some domains, especially on ARLBench, may be very small due to low running time.
Figure 2: Overview of the ARLBench framework. The AutoRL environment, providing a Gymnasium-like interface towers-gymnasium23a, is the interaction point for HPO methods. At optimization step $t$, the optimizer selects a hyperparameter configuration $\lambda_t$ and a training budget (number of steps) $b_t$. Then, the RL algorithm is trained using the given configuration and budget. As a result, the AutoRL environment returns the training result in the form of optimization objectives $o_t$, e.g., the evaluation return and runtime, and state features $x_t$, e.g., gradients during training.
Figure 3: Comparison of the Spearman correlation for different subset sizes with confidence intervals from 5-fold cross-validation on the configurations.
Figure 4: Selected set of representative environments per algorithm. For PPO, the discrete variant of LunarLander was selected.
Figure 5: Comparison of the return distributions over hyperparameter configurations of PPO on all 21 environments (left) and the selected subset of 5 environments (right). For the same comparisons for DQN and SAC, see Appendix \ref{['app:perf_dists']}.
...and 44 more figures

ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

TL;DR

Abstract

ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (49)