Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

Hui Bai; Ran Cheng

Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

Hui Bai, Ran Cheng

TL;DR

The paper tackles the challenge of dynamic hyperparameter optimization in reinforcement learning by extending Population-Based Training (PBT) into Generalized Population-Based Training (GPBT) and pairing it with Pairwise Learning (PL). GPBT replaces direct elite-based replacements with asynchronous pairings, promoting diversity, while PL provides lagging agents with a pseudo-gradient inspired by performance differences to guide updates, yielding GPBT-PL. Empirical results across on-policy PPO and off-policy IMPALA benchmarks (OpenAI Gym) show GPBT-PL consistently outperforms standard PBT and Bayesian-optimized variants, particularly in more complex tasks and under resource constraints. The work demonstrates improved adaptability and computational efficiency for HPO in RL and suggests future directions for handling high-dimensional hyperparameter spaces with evolutionary ideas.

Abstract

Hyperparameter optimization plays a key role in the machine learning domain. Its significance is especially pronounced in reinforcement learning (RL), where agents continuously interact with and adapt to their environments, requiring dynamic adjustments in their learning trajectories. To cater to this dynamicity, the Population-Based Training (PBT) was introduced, leveraging the collective intelligence of a population of agents learning simultaneously. However, PBT tends to favor high-performing agents, potentially neglecting the explorative potential of agents on the brink of significant advancements. To mitigate the limitations of PBT, we present the Generalized Population-Based Training (GPBT), a refined framework designed for enhanced granularity and flexibility in hyperparameter adaptation. Complementing GPBT, we further introduce Pairwise Learning (PL). Instead of merely focusing on elite agents, PL employs a comprehensive pairwise strategy to identify performance differentials and provide holistic guidance to underperforming agents. By integrating the capabilities of GPBT and PL, our approach significantly improves upon traditional PBT in terms of adaptability and computational efficiency. Rigorous empirical evaluations across a range of RL benchmarks confirm that our approach consistently outperforms not only the conventional PBT but also its Bayesian-optimized variant.

Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
General HPO Methods
HPO in RL
Population-Based HPO Methods
Population-Based Training
Motivation
Proposed Approach
Problem Statement
Generalized Population-Based Training (GPBT)
Pairwise Learning (PL)
GPBT-PL
Experiments
Experimental Settings
Hyperparameter Settings
...and 8 more sections

Figures (9)

Figure 1: Framework of Generalized Population-Based Training (GPBT). A population of agents are initialized with random weights and hyperparameters and then trained in parallel. Upon reaching designated hyperparameter update intervals, ready agents undergo asynchronous random pairing for updates. If the ready agent underperforms, it adopts the weights of its superior counterpart and updates its hyperparameters using specialized learning techniques. After fulfilling the stopping criteria, top-performing agents are identified.
Figure 2: Eight RL tasks selected from OpenAI Gym.
Figure 3: Training curves for six OpenAI Gym benchmarks using populations of 4 and 8 agents with GPBT-PL, PBT, and PB2. Thick lines represent the average of the best mean rewards over 7 seeds, with shaded regions denoting the standard deviation. Brackets specify the population size, and the perturbation interval is set to $5 \times 10^{4}$.
Figure 4: Training curves for Ant and HalfCheetah using populations of 4 and 8 agents with GPBT-PL, PBT, and PB2. (a)-(d) take timesteps as the x-axis and (c)-(h) take time (in hours) as the x-axis. Thick lines are the best-performing members of the population of each HPO method, with faint lines representing each member. Brackets specify the population size, and the perturbation interval is set to $5 \times 10^{4}$.
Figure 5: Training curves for 4-agent populations using GPBT-PL, PBT, and PB2 on two OpenAI Gym benchmarks. Thick lines denote average best rewards over 7 seeds, and shaded regions indicate standard deviation. The learning rate is set between $[10^{-5},10^{-3}]$.
...and 4 more figures

Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

TL;DR

Abstract

Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)