Flattening Hierarchies with Policy Bootstrapping
John L. Zhou, Jonathan C. Kao
TL;DR
The paper tackles long-horizon offline GCRL by removing the complexity of hierarchical subgoal generators and deriving SAW, a flat policy objective that bootstraps from subgoal-conditioned actions using advantage-weighted importance from dataset trajectories. By casting HRL as probabilistic inference and then eliminating the subgoal generator, SAW directly learns a unified policy that benefits from near-goal subpolicy training while avoiding high-dimensional generative modeling. Empirically, SAW matches or exceeds state-of-the-art offline GCRL baselines across 20 state- and pixel-based locomotion and manipulation tasks, including challenging long-horizon scenarios, and demonstrates robustness to high-dimensional observations when paired with appropriate representation learning. The approach offers a simplified, scalable path toward robotic foundation policies by internalizing hierarchical strengths within a single, data-driven bootstrapping framework, with limitations related to subgoal sampling bias and representation scalability.
Abstract
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/
