Table of Contents
Fetching ...

Flattening Hierarchies with Policy Bootstrapping

John L. Zhou, Jonathan C. Kao

TL;DR

The paper tackles long-horizon offline GCRL by removing the complexity of hierarchical subgoal generators and deriving SAW, a flat policy objective that bootstraps from subgoal-conditioned actions using advantage-weighted importance from dataset trajectories. By casting HRL as probabilistic inference and then eliminating the subgoal generator, SAW directly learns a unified policy that benefits from near-goal subpolicy training while avoiding high-dimensional generative modeling. Empirically, SAW matches or exceeds state-of-the-art offline GCRL baselines across 20 state- and pixel-based locomotion and manipulation tasks, including challenging long-horizon scenarios, and demonstrates robustness to high-dimensional observations when paired with appropriate representation learning. The approach offers a simplified, scalable path toward robotic foundation policies by internalizing hierarchical strengths within a single, data-driven bootstrapping framework, with limitations related to subgoal sampling bias and representation scalability.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/

Flattening Hierarchies with Policy Bootstrapping

TL;DR

The paper tackles long-horizon offline GCRL by removing the complexity of hierarchical subgoal generators and deriving SAW, a flat policy objective that bootstraps from subgoal-conditioned actions using advantage-weighted importance from dataset trajectories. By casting HRL as probabilistic inference and then eliminating the subgoal generator, SAW directly learns a unified policy that benefits from near-goal subpolicy training while avoiding high-dimensional generative modeling. Empirically, SAW matches or exceeds state-of-the-art offline GCRL baselines across 20 state- and pixel-based locomotion and manipulation tasks, including challenging long-horizon scenarios, and demonstrates robustness to high-dimensional observations when paired with appropriate representation learning. The approach offers a simplified, scalable path toward robotic foundation policies by internalizing hierarchical strengths within a single, data-driven bootstrapping framework, with limitations related to subgoal sampling bias and representation scalability.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/

Paper Structure

This paper contains 38 sections, 43 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Learning with subgoals. Both HIQL and RIS "imagine" subgoals (thought bubbles) en route to the goal (red star) with generative models. However, HIQL samples actions directly from the subgoal-conditioned policy, while RIS regresses (black arrow) a flat goal-conditioned policy towards the subgoal-conditioned action distribution during training. SAW also performs regression but only uses "real" subgoals from the dataset $\mathcal{D}$, weighting the regression more heavily towards distributions conditioned on good subgoals and less (gray arrow) towards bad ones.
  • Figure 2: OGBench tasks. We train SAW on 20 datasets collected from 7 different environments (pictured above) and perform evaluations across 5 state-goal pairs for each dataset.
  • Figure 3: Subgoal representations scale poorly to high-dimensional control in large state spaces. Using HIQL's subgoal representations (taken from an intermediate layer of the value function) for SAW's target subpolicy harms performance compared to training directly on observations. However, HIQL fails to learn meaningful behaviors when predicting subgoals directly in the raw observation space. RIS, which bootstraps on generated subgoals at every step, performs the worst of the three.
  • Figure 4: Training curves for scene-play, antmaze-large-navigate, and humanoid-giant-navigate on the left, and the mean one-step advantage over dataset actions with respect to subgoals on the right.
  • Figure 5: Training curves for cube-single-play and cube-double-play with one-step ablations.
  • ...and 3 more figures