The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Andrew Patterson; Samuel Neumann; Raksha Kumaraswamy; Martha White; Adam White

The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Andrew Patterson, Samuel Neumann, Raksha Kumaraswamy, Martha White, Adam White

TL;DR

The paper tackles the challenge of reliably evaluating reinforcement learning algorithms across diverse environments without extensive hyperparameter tuning. It proposes the Cross-environment Hyperparameter Setting Benchmark (CHS), a four-step framework that uses a small preliminary sweep, cross-environment score normalization via $N_E(G)=\text{CDF}(G,E)$, and a single cross-environment hyperparameter choice $\theta_{CHS}$ followed by a thorough re-evaluation. Through SC-CHS and a large-scale DMC-CHS demonstration, the authors show CHS yields stable algorithm ordering with far fewer tuning runs, while also revealing that many algorithms struggle to generalize across environments; in the DM Control study, there is no meaningful difference between Ornstein-Uhlenbeck noise and Gaussian exploration for DDPG across 28 environments. The work argues that CHS provides a practical, low-cost, and reproducible benchmark for advancing generality and reliability in RL, and can help resolve empirical disputes by focusing on cross-environment performance instead of environment-specific tuning.

Abstract

This paper introduces a new empirical methodology, the Cross-environment Hyperparameter Setting Benchmark, that compares RL algorithms across environments using a single hyperparameter setting, encouraging algorithmic development which is insensitive to hyperparameters. We demonstrate that this benchmark is robust to statistical noise and obtains qualitatively similar results across repeated applications, even when using few samples. This robustness makes the benchmark computationally cheap to apply, allowing statistically sound insights at low cost. We demonstrate two example instantiations of the CHS, on a set of six small control environments (SC-CHS) and on the entire DM Control suite of 28 environments (DMC-CHS). Finally, to illustrate the applicability of the CHS to modern RL algorithms on challenging environments, we conduct a novel empirical study of an open question in the continuous control literature. We show, with high confidence, that there is no meaningful difference in performance between Ornstein-Uhlenbeck noise and uncorrelated Gaussian noise for exploration with the DDPG algorithm on the DMC-CHS.

The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

TL;DR

, and a single cross-environment hyperparameter choice

followed by a thorough re-evaluation. Through SC-CHS and a large-scale DMC-CHS demonstration, the authors show CHS yields stable algorithm ordering with far fewer tuning runs, while also revealing that many algorithms struggle to generalize across environments; in the DM Control study, there is no meaningful difference between Ornstein-Uhlenbeck noise and Gaussian exploration for DDPG across 28 environments. The work argues that CHS provides a practical, low-cost, and reproducible benchmark for advancing generality and reliability in RL, and can help resolve empirical disputes by focusing on cross-environment performance instead of environment-specific tuning.

Abstract

Paper Structure (17 sections, 1 equation, 18 figures)

This paper contains 17 sections, 1 equation, 18 figures.

Introduction
Contrasting Across-Environment versus Per-Environment Tuning
Performance Distributions
The Cross-environment Hyperparameter Setting Benchmark
Evaluating the Cross-environment Hyperparameter Setting Benchmark
A Demonstrative Example of Using the CHS
Conclusion
Ethical considerations
Additional Results
Distribution of selected hyperparameters.
Tuning on a subset
Performance distributions
Results when using worst-case performance across environments
DMControl demonstration
Further Experimental Details
...and 2 more sections

Figures (18)

Figure 1: Chance of incorrect claims
Figure 2: An example experiment comparing four algorithms across six different environments. Each learning curve shows the mean and 95% confidence interval of 250 independent runs for each algorithm and environment. Hyperparameters are selected using three runs of every algorithm, environment, and hyperparameter setting. Top shows the learning curves when the best hyperparameters are chosen for each environment individually. Bottom shows the learning curves when hyperparameters are chosen according to the CHS.
Figure 3:
Figure 4: Applying the CHS to 10k simulated experiments. Error bars show 95% bootstrap confidence intervals. Although only three runs were used to select hyperparameters, conclusions about algorithm ranking using the CHS are perfectly consistent across all 10k experiments.
Figure 5: The change in performance for each algorithm on every environment when using the CHS versus conventional per-environment tuning. A larger drop in performance indicates a larger degree of environment overfitting when results are reported with per-environment tuning. Error bars show 95% confidence intervals over 10k bootstrap samples.
...and 13 more figures

The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

TL;DR

Abstract

The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (18)