Table of Contents
Fetching ...

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

Shaswat Garg, Matin Moezzi, Brandon Da Silva

TL;DR

NF-HIQL introduces a flow-based hierarchical RL framework that replaces Gaussian policies with expressive normalizing-flow policies at both the high- and low-levels of HIQL, enabling multimodal and data-efficient learning in long-horizon tasks. The method provides tractable log-densities, exact likelihoods, and closed-form objectives, along with KL-divergence and PAC-style guarantees that support stability and sample efficiency in offline regimes. Empirically, NF-HIQL achieves state-of-the-art performance on diverse OGBench tasks and exhibits strong data-efficiency, including scenarios with only half of the available data. A real-world robotic validation on a 6-DOF arm demonstrates reliable transfer and sustained performance under limited offline data, underscoring practical applicability and robustness.

Abstract

Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

TL;DR

NF-HIQL introduces a flow-based hierarchical RL framework that replaces Gaussian policies with expressive normalizing-flow policies at both the high- and low-levels of HIQL, enabling multimodal and data-efficient learning in long-horizon tasks. The method provides tractable log-densities, exact likelihoods, and closed-form objectives, along with KL-divergence and PAC-style guarantees that support stability and sample efficiency in offline regimes. Empirically, NF-HIQL achieves state-of-the-art performance on diverse OGBench tasks and exhibits strong data-efficiency, including scenarios with only half of the available data. A real-world robotic validation on a 6-DOF arm demonstrates reliable transfer and sustained performance under limited offline data, underscoring practical applicability and robustness.

Abstract

Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
Paper Structure (18 sections, 5 theorems, 38 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 5 theorems, 38 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Lemma 2

Let $\pi^b(\cdot \mid s)$ denote the behavior policy and $\pi_\theta(\cdot \mid s)$ the learned RealNVP policy given state $s$. If the action space is bounded and the behavior density is capped by a constant $M<\infty$, then there exists a constant $B<\infty$ (determined by the RealNVP architecture)

Figures (3)

  • Figure 1: Evaluation environments: (a) AntMaze—medium-navigate (long-horizon maze navigation); (b) AntSoccer—medium-navigate (wall-bounded dribbling and navigation); (c) AntSoccer—arena-navigate (open-field dribbling and navigation); (d) Cube—single-play (pick-and-place from play data); (e) Scene—play (multi-object, multi-step sequencing from play) park2024ogbench.
  • Figure 2: Success rate (%) across training steps on OGBench environments. NF-HIQL consistently outperforms baselines, showing faster convergence and higher final success rates, particularly in complex manipulation tasks (cube-single-play, scene-play) and multi-agent soccer settings.
  • Figure 3: Experimental setup with the 6 DOF myCobot 280 arm.

Theorems & Definitions (10)

  • Lemma 2: KL Divergence Bound
  • proof
  • Lemma 3: PAC-Style Sample Efficiency
  • proof
  • Lemma 1: RealNVP Lower Bound
  • proof
  • Lemma 2: KL Bound with Behavior Density Cap
  • proof
  • Lemma 3: PAC-style sample efficiency with explicit constants
  • proof