Table of Contents
Fetching ...

On the consistency of hyper-parameter selection in value-based deep reinforcement learning

Johan Obando-Ceron, João G. M. Araújo, Aaron Courville, Pablo Samuel Castro

TL;DR

The paper tackles the problem of hyper-parameter selection reliability in value-based deep reinforcement learning by introducing the THC score, a ranking-based consistency metric across training regimes. It executes a large-scale empirical study with 12 hyper-parameters, two agents (DER and DrQ($\epsilon$)), 26 Atari environments, and two data regimes ($100k$ and $40M$ frames), totaling about $108k$ training runs, complemented by a web-based appendix. The findings show limited transferability across data regimes and environments, while some hyper-parameter settings transfer reasonably across agents; epsilon and update-period-like hyper-parameters are particularly tuning-sensitive. The work offers a practical framework for robust hyper-parameter selection and interpretable exploration of transferability, highlighting the need for environment-aware or dynamic tuning in real-world DRL deployments.

Abstract

Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.

On the consistency of hyper-parameter selection in value-based deep reinforcement learning

TL;DR

The paper tackles the problem of hyper-parameter selection reliability in value-based deep reinforcement learning by introducing the THC score, a ranking-based consistency metric across training regimes. It executes a large-scale empirical study with 12 hyper-parameters, two agents (DER and DrQ()), 26 Atari environments, and two data regimes ( and frames), totaling about training runs, complemented by a web-based appendix. The findings show limited transferability across data regimes and environments, while some hyper-parameter settings transfer reasonably across agents; epsilon and update-period-like hyper-parameters are particularly tuning-sensitive. The work offers a practical framework for robust hyper-parameter selection and interpretable exploration of transferability, highlighting the need for environment-aware or dynamic tuning in real-world DRL deployments.

Abstract

Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.

Paper Structure

This paper contains 20 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Tuning hyper-parameter Consistency (THC Score, see \ref{['sec:thc_metric']}) evaluated across agents (left panel), data regimes (center panel), and environments (right panel). Different colors indicate different data regimes (left panel) and different agents (center and right panels); grey bars/titles indicate hyper-parameters which are not comparable across the considered transfer settings.
  • Figure 2: Measured IQM of human-normalized scores on the $26$$100$k benchmark games, with varying Adam's $\epsilon$ for DER. We evaluate performance at 100k agent steps (or 400k environment frames), and at $40$ million environment frames. The ordering of the best hyper-parameters switches between the two data regimes.
  • Figure 3: Measured returns with varying batch size for DrQ($\epsilon$) (top) and DER (bottom) at $40$M environment frames for four representative games, demonstrating that the ranking of the hyper-parameter values can drastically change from one game to the next. All results averaged over $5$ seeds, shaded areas represent $95\%$ confidence intervals.
  • Figure 4: Measured returns with various hyper-parameter variations on Asterix for DrQ($\epsilon$) (top) and DER (bottom) at 40M environment frames. Displaying eight representative hyper-parameters, enabling per-game analyses for hyper-parameter selection.
  • Figure 5: Measured IQM of human-normalized scores on the 26 100k benchmark games, with varying Weight Decay for DER. We evaluate performance at 100k agent steps (or 400k environment frames), and at 40 million environment frames. At 40 million frames 0.1 is on average optimal, with 0.5 being at second place and the standard value of 0.0 being in fourth.
  • ...and 2 more figures