Table of Contents
Fetching ...

Combining Automated Optimisation of Hyperparameters and Reward Shape

Julian Dierkes, Emma Cramer, Holger H. Hoos, Sebastian Trimpe

TL;DR

Combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs, suggesting that combined optimisation should be best practice.

Abstract

There has been significant progress in deep reinforcement learning (RL) in recent years. Nevertheless, finding suitable hyperparameter configurations and reward functions remains challenging even for experts, and performance heavily relies on these design choices. Also, most RL research is conducted on known benchmarks where knowledge about these choices already exists. However, novel practical applications often pose complex tasks for which no prior knowledge about good hyperparameters and reward functions is available, thus necessitating their derivation from scratch. Prior work has examined automatically tuning either hyperparameters or reward functions individually. We demonstrate empirically that an RL algorithm's hyperparameter configurations and reward function are often mutually dependent, meaning neither can be fully optimised without appropriate values for the other. We then propose a methodology for the combined optimisation of hyperparameters and the reward function. Furthermore, we include a variance penalty as an optimisation objective to improve the stability of learned policies. We conducted extensive experiments using Proximal Policy Optimisation and Soft Actor-Critic on four environments. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs. This suggests that combined optimisation should be best practice.

Combining Automated Optimisation of Hyperparameters and Reward Shape

TL;DR

Combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs, suggesting that combined optimisation should be best practice.

Abstract

There has been significant progress in deep reinforcement learning (RL) in recent years. Nevertheless, finding suitable hyperparameter configurations and reward functions remains challenging even for experts, and performance heavily relies on these design choices. Also, most RL research is conducted on known benchmarks where knowledge about these choices already exists. However, novel practical applications often pose complex tasks for which no prior knowledge about good hyperparameters and reward functions is available, thus necessitating their derivation from scratch. Prior work has examined automatically tuning either hyperparameters or reward functions individually. We demonstrate empirically that an RL algorithm's hyperparameter configurations and reward function are often mutually dependent, meaning neither can be fully optimised without appropriate values for the other. We then propose a methodology for the combined optimisation of hyperparameters and the reward function. Furthermore, we include a variance penalty as an optimisation objective to improve the stability of learned policies. We conducted extensive experiments using Proximal Policy Optimisation and Soft Actor-Critic on four environments. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs. This suggests that combined optimisation should be best practice.

Paper Structure

This paper contains 28 sections, 7 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Illustration of the two-level optimisation process. Outer loop: hyper- and reward parameter optimisation; inner loop: RL training. In each iteration, the parameter optimiser chooses parameters and receives their performance measured by $\mathcal{O}_{\text{goal}}(\pi)$.
  • Figure 2: Landscapes depicting the average return on LunarLander for pairwise hyperparameters and reward weights over ten PPO trainings. Lower values (lighter) correspond to faster landing (better performance). Yellow lines mark each parameter's default value. Blue lines denote the best-performing reward weights for each hyperparameter value. The black dots mark the incumbent configurations found in the joint optimisation experiments in Section \ref{['sec:joint_opt_perf']}.
  • Figure 3: Incumbent performance in terms of median optimisation objective across the five optimisation runs for the SAC experiments at each time step; shaded areas indicate min and max values. The performance drop in the multi-objective experiments is due to the weighted penalty term.
  • Figure 4: Illustrations from left to right of the environments Gymnasium LunarLander, Google Brax Ant and Humanoid, and Robosuite Wipe.
  • Figure 5: Landscapes depicting the average return on Gymnasium LunarLander for pairwise hyper- and reward parameters over ten PPO trainings. Lower values (lighter) correspond to faster landing time and, thus, better performance. The yellow lines mark the default values for each parameter. The blue line denotes the best-performing hyperparameter value for each specific reward value. The black dots mark the incumbent configurations found in the joint optimisation experiments in Section \ref{['sec:joint_opt_perf']}
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition