Table of Contents
Fetching ...

Adaptable Hindsight Experience Replay for Search-Based Learning

Alexandros Vazaios, Jannis Brugger, Cedric Derstroff, Kristian Kersting, Mira Mezini

TL;DR

Sparse reward signals hinder training in neural-guided MCTS. The authors propose Adaptable Hindsight Experience Replay (AHER), a unifying framework that parameterizes HER across four properties and integrates it with AlphaZero-like search. Across bit-flipping, point maze, and equation discovery, AHER demonstrates that customizing HER configurations yields improvements over pure reinforcement learning or supervised learning, with task-dependent optimal settings. The work highlights both the potential and limitations of HER in neural-guided search and points to future directions such as probabilistic transitions and curriculum-based relabeling.

Abstract

AlphaZero-like Monte Carlo Tree Search systems, originally introduced for two-player games, dynamically balance exploration and exploitation using neural network guidance. This combination makes them also suitable for classical search problems. However, the original method of training the network with simulation results is limited in sparse reward settings, especially in the early stages, where the network cannot yet give guidance. Hindsight Experience Replay (HER) addresses this issue by relabeling unsuccessful trajectories from the search tree as supervised learning signals. We introduce Adaptable HER (\ours{}), a flexible framework that integrates HER with AlphaZero, allowing easy adjustments to HER properties such as relabeled goals, policy targets, and trajectory selection. Our experiments, including equation discovery, show that the possibility of modifying HER is beneficial and surpasses the performance of pure supervised or reinforcement learning.

Adaptable Hindsight Experience Replay for Search-Based Learning

TL;DR

Sparse reward signals hinder training in neural-guided MCTS. The authors propose Adaptable Hindsight Experience Replay (AHER), a unifying framework that parameterizes HER across four properties and integrates it with AlphaZero-like search. Across bit-flipping, point maze, and equation discovery, AHER demonstrates that customizing HER configurations yields improvements over pure reinforcement learning or supervised learning, with task-dependent optimal settings. The work highlights both the potential and limitations of HER in neural-guided search and points to future directions such as probabilistic transitions and curriculum-based relabeling.

Abstract

AlphaZero-like Monte Carlo Tree Search systems, originally introduced for two-player games, dynamically balance exploration and exploitation using neural network guidance. This combination makes them also suitable for classical search problems. However, the original method of training the network with simulation results is limited in sparse reward settings, especially in the early stages, where the network cannot yet give guidance. Hindsight Experience Replay (HER) addresses this issue by relabeling unsuccessful trajectories from the search tree as supervised learning signals. We introduce Adaptable HER (\ours{}), a flexible framework that integrates HER with AlphaZero, allowing easy adjustments to HER properties such as relabeled goals, policy targets, and trajectory selection. Our experiments, including equation discovery, show that the possibility of modifying HER is beneficial and surpasses the performance of pure supervised or reinforcement learning.

Paper Structure

This paper contains 14 sections, 2 figures.

Figures (2)

  • Figure 1: Architectural overview of our learning setup. AHER is connected to the AlphaZero training loop. An iteration of the loop consists of performing neural-guided MCTS, sampling batches from the experience replay buffer, and training the predictor neural network.
  • Figure 2: Performance of AHER-augmented AlphaZero. (left) Visited states in the search tree until the correct equation is found. AlphaZero refers to training only with the probabilities from the MCTS, AHER($k$) for adding HER samples from $k$ trajectories, and SL (d.c.) for supervised learning with dataset changes. Dataset changes entail sampling new $x$ and fitting $y$ measurements after each training step to increase training diversity. We chose to also perform this during AHER goal relabeling. The HER samples are added with the "final" goal selection strategy and one-hot policy targets. Failed searches count as 1,000 visited states. (middle and right) Success rate for solving bit-flipping with 50 bits (middle) and point maze (right), depending on the number of HER samples. The HER samples are added with the "future" goal selection strategy from the played trajectory and original MCTS probabilities. We display the mean and 95% confidence intervals of 5 runs.