DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents
Shashank Sharma, Janina Hoffmann, Vinay Namboodiri
TL;DR
DHP tackles long-horizon planning in hierarchical RL by replacing continuous distance metrics with discrete reachability checks to judge subgoal feasibility. It builds recursive subtask trees via a planning policy operating in a learned latent space (GCSR) and optimizes with a novel tree-based return that min-controls to encourage solving bottlenecks, complemented by a memory-augmented explorer for efficient data collection. The online variant demonstrates state-of-the-art performance on a 25-room navigation task and substantial gains on OGBench humanoid mazes, while an offline variant confirms architecture-agnostic applicability and scalability. Collectively, these contributions offer a robust, interpretable, and scalable framework for long-horizon planning that extends across online and offline settings.
Abstract
Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate a 100% success rate (vs. 90% baseline). We also present an offline variant that achieves state-of-the-art results on OGBench benchmarks, with up to 71% absolute gains on giant HumanoidMaze tasks, demonstrating our core contributions are architecture-agnostic. The method also generalizes to momentum-based control tasks and requires only log N steps for replanning. Theoretical analysis and ablations validate our design choices.
