Table of Contents
Fetching ...

DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents

Shashank Sharma, Janina Hoffmann, Vinay Namboodiri

TL;DR

DHP tackles long-horizon planning in hierarchical RL by replacing continuous distance metrics with discrete reachability checks to judge subgoal feasibility. It builds recursive subtask trees via a planning policy operating in a learned latent space (GCSR) and optimizes with a novel tree-based return that min-controls to encourage solving bottlenecks, complemented by a memory-augmented explorer for efficient data collection. The online variant demonstrates state-of-the-art performance on a 25-room navigation task and substantial gains on OGBench humanoid mazes, while an offline variant confirms architecture-agnostic applicability and scalability. Collectively, these contributions offer a robust, interpretable, and scalable framework for long-horizon planning that extends across online and offline settings.

Abstract

Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate a 100% success rate (vs. 90% baseline). We also present an offline variant that achieves state-of-the-art results on OGBench benchmarks, with up to 71% absolute gains on giant HumanoidMaze tasks, demonstrating our core contributions are architecture-agnostic. The method also generalizes to momentum-based control tasks and requires only log N steps for replanning. Theoretical analysis and ablations validate our design choices.

DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents

TL;DR

DHP tackles long-horizon planning in hierarchical RL by replacing continuous distance metrics with discrete reachability checks to judge subgoal feasibility. It builds recursive subtask trees via a planning policy operating in a learned latent space (GCSR) and optimizes with a novel tree-based return that min-controls to encourage solving bottlenecks, complemented by a memory-augmented explorer for efficient data collection. The online variant demonstrates state-of-the-art performance on a 25-room navigation task and substantial gains on OGBench humanoid mazes, while an offline variant confirms architecture-agnostic applicability and scalability. Collectively, these contributions offer a robust, interpretable, and scalable framework for long-horizon planning that extends across online and offline settings.

Abstract

Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate a 100% success rate (vs. 90% baseline). We also present an offline variant that achieves state-of-the-art results on OGBench benchmarks, with up to 71% absolute gains on giant HumanoidMaze tasks, demonstrating our core contributions are architecture-agnostic. The method also generalizes to momentum-based control tasks and requires only log N steps for replanning. Theoretical analysis and ablations validate our design choices.

Paper Structure

This paper contains 50 sections, 4 theorems, 42 equations, 14 figures, 5 tables, 6 algorithms.

Key Result

Theorem A.1

Given a tree trajectory $\tau$ specified as a list of nodes $n_i$, generated using a policy $\pi_P$. The policy gradients can be written as:

Figures (14)

  • Figure 1: Illustrations for different module architectures. (a) Overall planning agent architecture. The world model predicts the state $s_t$, the planner takes the current and goal states $(s_t,s_g)$ to output a latent variable $z$, the GCSR Decoder is then used to predict a subgoal $s_i$. Then the subgoal is used as a goal to predict another subgoal. This continues recursively till a reachable subgoal $s_{wg}$ is found, which is then passed to the worker. (b) The GCSR module is a conditional VAE that consists of an encoder and a decoder optimized to predict midway states, given the initial and final states. (c) The planning policy uses the GCSR decoder to predict subgoals.
  • Figure 2: The figure illustrates the plan unrolling process during the training and inference phases. (a) During Training, the initial task $(s_t,s_g)$ is recursively decomposed into smaller tasks by midway subgoal prediction to generate a subtask tree. The lowest level nodes represent the simplest decomposition of the initial task as: $(s_t,s_g) \rightarrow (s_t,s_3,s_1,s_4,s_0,s_5,s_2,s_6,s_g)$. (b) During Inference, only the first branch of the tree is unrolled. Here, the agent is tasked with reaching $s_g$. So it first divides the task $(s_t,s_g)$ into two chunks by inserting $s_0$. Then it proceeds to divide the subtask $(s_t,s_0)$ by inserting $s_1$ and ignoring the second part $(s_0,s_g)$. The recursive division continues till the first subgoal reachable in $K$ steps is found. This results in a stack of subgoals shown at the bottom.
  • Figure 3: Example return estimations for an imperfect tree (node indices at top-left). The dash-bordered cells indicate terminal nodes where the policy receives a $1$ reward and the branch terminates. While one of the branches terminates early $(i=4)$, one does not for the unrolled depth $(i=11)$. (a) Since we compute the return as the $\min$ of the child nodes, the Monte-Carlo return at the root node is $0$ in this case. However, a positive learning signal is still induced at nodes $(i=1,3,6)$. (b) Using the critic as a baseline for computing $n$-step returns. The $n$-step returns allow bootstrapping by substituting the reward with value estimates $v$ at the truncated node $(i=11)$. This induces a learning signal at the root node even if the plan is incomplete for the unrolled depth.
  • Figure 4: (a) Unconditional VAE that learns to predict states not conditioned on other states. (b) The explorer uses the memory and the VAE decoder to predict subgoals for the worker. (c) Illustration showing the states required as inputs for the Explorer for an example trajectory (top) being played out by an agent. It is a coarse trajectory that shows every $K$-th frame. The agent is at state $s_t$ and will receive rewards when it moves into the placeholder future state $s_{t+1}$ (dashed border). The rewards at $s_{t+1}$ will be computed using the GCSR for different temporal resolutions $q \in Q$ indicated on the right. The colored arrows indicate the state triplet required to compute the exploratory reward at $s_{t+1}$. Combining these state dependencies and removing redundancies yields the input requirements indicated below the dashed line. The inputs consist of the current state and the memory.
  • Figure 5: Full maze with a sample run using our agent.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Theorem A.1: Policy Gradients
  • proof
  • Theorem A.2: Baselines
  • proof
  • Lemma A.3: Non-expansive Property of the Minimum Operator
  • proof
  • Theorem A.4: Contraction property of the return operators
  • proof