Table of Contents
Fetching ...

Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search

Gal Hadar, Forest Agostinelli, Shahaf S. Shperberg

TL;DR

The paper tackles learning heuristics for shortest-path problems by moving beyond single-step Bellman updates. It introduces Limited-Horizon Bellman-based Learning (LHBL), which uses limited-horizon search to generate training data and updates heuristics with a frontier-aware target $h_{LHB}$ that accounts for full paths to frontier leaves. The target computation is made efficient and cycle-aware via a single-source shortest-path formulation on a reversed augmented graph with an auxiliary node, enabling stable training of a Deep Neural Network heuristics. Empirical results across Rubik's Cube, 35-Slide Tile Puzzle, and Lights Out show LHBL generally improves sample efficiency, reduces depression regions, and yields faster convergence and better search performance than traditional SSBL, with horizon length presenting a trade-off between lookahead depth and overfitting.

Abstract

Many sequential decision-making problems can be formulated as shortest-path problems, where the objective is to reach a goal state from a given starting state. Heuristic search is a standard approach for solving such problems, relying on a heuristic function to estimate the cost to the goal from any given state. Recent approaches leverage reinforcement learning to learn heuristics by applying deep approximate value iteration. These methods typically rely on single-step Bellman updates, where the heuristic of a state is updated based on its best neighbor and the corresponding edge cost. This work proposes a generalized approach that enhances both state sampling and heuristic updates by performing limited-horizon searches and updating each state's heuristic based on the shortest path to the search frontier, incorporating both edge costs and the heuristic values of frontier states.

Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search

TL;DR

The paper tackles learning heuristics for shortest-path problems by moving beyond single-step Bellman updates. It introduces Limited-Horizon Bellman-based Learning (LHBL), which uses limited-horizon search to generate training data and updates heuristics with a frontier-aware target that accounts for full paths to frontier leaves. The target computation is made efficient and cycle-aware via a single-source shortest-path formulation on a reversed augmented graph with an auxiliary node, enabling stable training of a Deep Neural Network heuristics. Empirical results across Rubik's Cube, 35-Slide Tile Puzzle, and Lights Out show LHBL generally improves sample efficiency, reduces depression regions, and yields faster convergence and better search performance than traditional SSBL, with horizon length presenting a trade-off between lookahead depth and overfitting.

Abstract

Many sequential decision-making problems can be formulated as shortest-path problems, where the objective is to reach a goal state from a given starting state. Heuristic search is a standard approach for solving such problems, relying on a heuristic function to estimate the cost to the goal from any given state. Recent approaches leverage reinforcement learning to learn heuristics by applying deep approximate value iteration. These methods typically rely on single-step Bellman updates, where the heuristic of a state is updated based on its best neighbor and the corresponding edge cost. This work proposes a generalized approach that enhances both state sampling and heuristic updates by performing limited-horizon searches and updating each state's heuristic based on the shortest path to the search frontier, incorporating both edge costs and the heuristic values of frontier states.

Paper Structure

This paper contains 14 sections, 9 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Graph examples for comparing $h_{LHB}$ and $h_{SSB}$.
  • Figure 2: Illustration of graph transformation for computing $h_{LHB}$. The blue nodes are the frontier of the search graph; the red state $z$ is the auxiliary state.
  • Figure 3: Problems solved throughout the training: STP, LightsOut, and Rubik's Cube.
  • Figure 4: Results on fully trained heuristic: STP, LightsOut, and Rubik's Cube.
  • Figure 5: Example of STP representative state inside a depression region. Most tiles are in the correct place, but the state is still many steps away from being solved.
  • ...and 2 more figures