Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search
Gal Hadar, Forest Agostinelli, Shahaf S. Shperberg
TL;DR
The paper tackles learning heuristics for shortest-path problems by moving beyond single-step Bellman updates. It introduces Limited-Horizon Bellman-based Learning (LHBL), which uses limited-horizon search to generate training data and updates heuristics with a frontier-aware target $h_{LHB}$ that accounts for full paths to frontier leaves. The target computation is made efficient and cycle-aware via a single-source shortest-path formulation on a reversed augmented graph with an auxiliary node, enabling stable training of a Deep Neural Network heuristics. Empirical results across Rubik's Cube, 35-Slide Tile Puzzle, and Lights Out show LHBL generally improves sample efficiency, reduces depression regions, and yields faster convergence and better search performance than traditional SSBL, with horizon length presenting a trade-off between lookahead depth and overfitting.
Abstract
Many sequential decision-making problems can be formulated as shortest-path problems, where the objective is to reach a goal state from a given starting state. Heuristic search is a standard approach for solving such problems, relying on a heuristic function to estimate the cost to the goal from any given state. Recent approaches leverage reinforcement learning to learn heuristics by applying deep approximate value iteration. These methods typically rely on single-step Bellman updates, where the heuristic of a state is updated based on its best neighbor and the corresponding edge cost. This work proposes a generalized approach that enhances both state sampling and heuristic updates by performing limited-horizon searches and updating each state's heuristic based on the shortest path to the search frontier, incorporating both edge costs and the heuristic values of frontier states.
