Table of Contents
Fetching ...

AutoRL Hyperparameter Landscapes

Aditya Mohan, Carolin Benjamins, Konrad Wienecke, Alexander Dockhorn, Marius Lindauer

TL;DR

This work introduces a pipeline to study dynamic hyperparameter landscapes in AutoRL by collecting performance data at multiple training phases and building landscape surrogates. It characterizes how hyperparameter effects shift over time using ILM and IGPR models, and evaluates unimodality of return distributions to assess stability. Through experiments with DQN, PPO, and SAC on CartPole, BipedalWalker, and Hopper, the paper reveals strong temporal variation in optimal hyperparameters, supporting the case for dynamic AutoRL strategies. The findings illuminate the non-stationary nature of RL optimization and provide tools and evidence to guide the design of future AutoRL methods and landscape analyses, with code available for replication at the authors' repository.

Abstract

Although Reinforcement Learning (RL) has shown to be capable of producing impressive results, its use is limited by the impact of its hyperparameters on performance. This often makes it difficult to achieve good results in practice. Automated RL (AutoRL) addresses this difficulty, yet little is known about the dynamics of the hyperparameter landscapes that hyperparameter optimization (HPO) methods traverse in search of optimal configurations. In view of existing AutoRL approaches dynamically adjusting hyperparameter configurations, we propose an approach to build and analyze these hyperparameter landscapes not just for one point in time but at multiple points in time throughout training. Addressing an important open question on the legitimacy of such dynamic AutoRL approaches, we provide thorough empirical evidence that the hyperparameter landscapes strongly vary over time across representative algorithms from RL literature (DQN, PPO, and SAC) in different kinds of environments (Cartpole, Bipedal Walker, and Hopper) This supports the theory that hyperparameters should be dynamically adjusted during training and shows the potential for more insights on AutoRL problems that can be gained through landscape analyses. Our code can be found at https://github.com/automl/AutoRL-Landscape

AutoRL Hyperparameter Landscapes

TL;DR

This work introduces a pipeline to study dynamic hyperparameter landscapes in AutoRL by collecting performance data at multiple training phases and building landscape surrogates. It characterizes how hyperparameter effects shift over time using ILM and IGPR models, and evaluates unimodality of return distributions to assess stability. Through experiments with DQN, PPO, and SAC on CartPole, BipedalWalker, and Hopper, the paper reveals strong temporal variation in optimal hyperparameters, supporting the case for dynamic AutoRL strategies. The findings illuminate the non-stationary nature of RL optimization and provide tools and evidence to guide the design of future AutoRL methods and landscape analyses, with code available for replication at the authors' repository.

Abstract

Although Reinforcement Learning (RL) has shown to be capable of producing impressive results, its use is limited by the impact of its hyperparameters on performance. This often makes it difficult to achieve good results in practice. Automated RL (AutoRL) addresses this difficulty, yet little is known about the dynamics of the hyperparameter landscapes that hyperparameter optimization (HPO) methods traverse in search of optimal configurations. In view of existing AutoRL approaches dynamically adjusting hyperparameter configurations, we propose an approach to build and analyze these hyperparameter landscapes not just for one point in time but at multiple points in time throughout training. Addressing an important open question on the legitimacy of such dynamic AutoRL approaches, we provide thorough empirical evidence that the hyperparameter landscapes strongly vary over time across representative algorithms from RL literature (DQN, PPO, and SAC) in different kinds of environments (Cartpole, Bipedal Walker, and Hopper) This supports the theory that hyperparameters should be dynamically adjusted during training and shows the potential for more insights on AutoRL problems that can be gained through landscape analyses. Our code can be found at https://github.com/automl/AutoRL-Landscape
Paper Structure (39 sections, 3 equations, 10 figures, 6 tables)

This paper contains 39 sections, 3 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: An overview of our hyperparameter landscape creation and analysis pipeline. With an RL algorithm, environment, and the hyperparameter search space, we collect performance data for hyperparameters covering the search space at multiple time steps throughout training (\ref{['sec:Method:data-collection']}). The gathered data relates algorithm performance to the algorithm configuration, which we use for modeling the landscapes (\ref{['sec:Method:landscape']}).
  • Figure 2: Overview of the data collection process for landscapes with three configurations $\bm{\lambda}$ and three phases. We initialize the process by training a random RL policy $\pi_{random}$ on each configuration $\lambda \in \bm{\lambda}$. The three configurations run till the first landscape point $t_{ls(1)}$ which forms the first landscape dataset $D_{ls(1)}$. The policies are snapshotted at this point, and the policy for the next phase is selected based on the final performance, indicated by the continuation of the blue points. The selected configuration is shown with orange circles. This process is repeated for two more phases to create landscape datasets $D_{ls(1)}$ and $D_{ls(3)}$. The end of the final phase $t_{ls(3)}$ corresponds to the final training point $t_{final}$
  • Figure 3: IGPR plots of the mean surfaces for learning rate and discount factor for DQN across three phases of the RL training process. The local minima are represented by the inverted triangle and the maxima by the normal triangle. The configuration selected for the next stage is represented by a star
  • Figure 4: IGPR plots for learning rate and discount factor for SAC across phases
  • Figure 5: IGPR plots of the mean surfaces for learning rate and discount factor for PPO across three phases of the RL training process.
  • ...and 5 more figures