Table of Contents
Fetching ...

Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals

Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen

TL;DR

This paper addresses instability and non-stationarity in off-policy hierarchical RL by introducing HIDI, which crafts subgoals through a state-conditioned conditional diffusion model and regularizes it with a Gaussian Process prior to quantify uncertainty. A hybrid subgoal selection strategy mixes diffusion-generated subgoals with the GP predictive mean, enabling robust and diverse planning across hierarchy levels. The approach, complemented by a sparse GP for scalability, demonstrates improved sample efficiency and performance on a suite of long-horizon MuJoCo tasks, with ablations showing the value of diffusion, GP regularization, and subgoal selection. Overall, HIDI advances uncertainty-aware subgoal generation in HRL, offering stronger learning stability and practical viability for complex continuous-control problems.

Abstract

Hierarchical reinforcement learning (HRL) learns to make decisions on multiple levels of temporal abstraction. A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals. To address this issue, the high-level policy must capture a complex subgoal distribution while also accounting for uncertainty in its estimates. We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals while leveraging principled GP uncertainty quantification. Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean. Our approach outperforms prior HRL methods in both sample efficiency and performance on challenging continuous control benchmarks.

Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals

TL;DR

This paper addresses instability and non-stationarity in off-policy hierarchical RL by introducing HIDI, which crafts subgoals through a state-conditioned conditional diffusion model and regularizes it with a Gaussian Process prior to quantify uncertainty. A hybrid subgoal selection strategy mixes diffusion-generated subgoals with the GP predictive mean, enabling robust and diverse planning across hierarchy levels. The approach, complemented by a sparse GP for scalability, demonstrates improved sample efficiency and performance on a suite of long-horizon MuJoCo tasks, with ablations showing the value of diffusion, GP regularization, and subgoal selection. Overall, HIDI advances uncertainty-aware subgoal generation in HRL, offering stronger learning stability and practical viability for complex continuous-control problems.

Abstract

Hierarchical reinforcement learning (HRL) learns to make decisions on multiple levels of temporal abstraction. A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals. To address this issue, the high-level policy must capture a complex subgoal distribution while also accounting for uncertainty in its estimates. We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals while leveraging principled GP uncertainty quantification. Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean. Our approach outperforms prior HRL methods in both sample efficiency and performance on challenging continuous control benchmarks.

Paper Structure

This paper contains 29 sections, 6 theorems, 60 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\mathbf{g} = f(\boldsymbol{\epsilon}', \mathbf{s}; \theta_h)$ be the subgoal generated by the diffusion model conditioned on state $\mathbf{s}$ and noise $\boldsymbol{\epsilon}'$. Under mild regularity assumptions, the GP regularization term in the high-level objective encourages the learned di

Figures (5)

  • Figure 1: Learning curves of our method and baselines, i.e., HLPSWangWYKP24, SAGAwang2023state, HIGLkim2021landmark, HRACZhangG0H020, and HIRONachumGLL18. Each curve and its shaded region represent the average success rate and 95% confidence interval respectively, averaged over 10 independent trials.
  • Figure 2: (a-b) Learning curves of various baselines: HIDI-A refers to HIDI without subgoal selection, HIDI-B refers to HIDI without subgoal selection and GP priors. (c) HIDI performance with varying diffusion steps. (d) HIDI performance with varying probabilities for subgoal selection.
  • Figure 3: Environments used in our experiments.
  • Figure 4: (Left) Impact of $\eta$, which balances the diffusion objective and RL objective. (Middle) Impact of $\psi$, which adjusts the influence of GP prior in learning the distribution of diffusional subgoals. (Right) Visualization of the learned inducing states (2D coordinates) compared with the complete training data.
  • Figure 5: Visualization of generated subgoals and reached subgoals of HIDI and compared baselines in Ant Maze (W-shape, sparse) with the same starting location.

Theorems & Definitions (11)

  • Theorem 3.1: GP Regularization Guides Subgoal Alignment
  • Proposition 3.2: Gradient Weighting by GP Uncertainty
  • Theorem 3.3: Single-Step Regret Bound for the Subgoal Selection
  • Proposition 3.4: Single-Step Policy Improvement
  • Theorem 1.1: Validity of the Learned Subgoal Distribution
  • proof
  • Theorem 1.2: Guiding Effect of GP Regularization
  • proof
  • Remark 1.3: Connection to KL Divergence
  • proof : Detailed Proof of Theorem \ref{['thm:mixture-regret_main']}
  • ...and 1 more