Table of Contents
Fetching ...

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

Seohong Park, Dibya Ghosh, Benjamin Eysenbach, Sergey Levine

TL;DR

The paper tackles offline goal-conditioned RL by addressing the difficulty of learning value functions for long-horizon goals. It introduces Hierarchical Implicit Q-Learning (HIQL), which derives a high-level subgoal policy and a low-level action policy from a single latent-goal value function learned via action-free IQL, with subgoals represented by phi(g) learned end-to-end. HIQL demonstrates strong improvements on six offline GO benchmarks, including high-dimensional pixel tasks, and shows the ability to leverage action-free data while maintaining robust performance under value-function noise. The work offers practical benefits for offline RL, scalable learning from diverse data, and a principled analysis of how hierarchical structure can improve signal-to-noise in value estimates, with limitations and directions for handling stochastic dynamics in future work.

Abstract

Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

TL;DR

The paper tackles offline goal-conditioned RL by addressing the difficulty of learning value functions for long-horizon goals. It introduces Hierarchical Implicit Q-Learning (HIQL), which derives a high-level subgoal policy and a low-level action policy from a single latent-goal value function learned via action-free IQL, with subgoals represented by phi(g) learned end-to-end. HIQL demonstrates strong improvements on six offline GO benchmarks, including high-dimensional pixel tasks, and shows the ability to leverage action-free data while maintaining robust performance under value-function noise. The work offers practical benefits for offline RL, scalable learning from diverse data, and a principled analysis of how hierarchical structure can improve signal-to-noise in value estimates, with limitations and directions for handling stochastic dynamics in future work.

Abstract

Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/
Paper Structure (40 sections, 2 theorems, 12 equations, 20 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 2 theorems, 12 equations, 20 figures, 4 tables, 1 algorithm.

Key Result

Proposition 4.1

In the environment described in fig:toy_env, the probability of the flat policy $\pi$ selecting an incorrect action is given as ${\mathcal{E}}(\pi) = \Phi\left(-\frac{\sqrt 2}{\sigma \sqrt{T^2 + 1}}\right)$ and the probability of the hierarchical policy $\pi^\ell \circ \pi^h$ selecting an incorrect

Figures (20)

  • Figure 1: (left) We train a value function parameterized as $V(s, \phi(g))$, where $\phi(g)$ corresponds to the subgoal representation. The high-level policy predicts the representation of a subgoal $z_{t+k} = \phi(s_{t+k})$. The low-level policy takes this representation as input to produce actions to reach the subgoal. (right) In contrast to many prior works on hierarchical RL, we extract both policies from the same value function. Nonetheless, this hierarchical structure yields a better "signal-to-noise" ratio than a flat, non-hierarchical policy, due to the improved relative differences between values.
  • Figure 2: Hierarchies allow us to better make use of noisy value estimates.(a) In this gridworld environment, the optimal value function predicts higher values for states $s$ that are closer to the goal $g$ (•). (b, c) However, a noisy value function results in selecting incorrect actions ($\rightarrow$). (d) Our method uses this same noisy value function to first predict an intermediate subgoal, and then select an action for reaching this subgoal. Actions selected in this way correctly lead to the goal.
  • Figure 3: 1-D toy environment.
  • Figure 4: Comparison of policy errors in flat vs. hierarchical policies in didactic environments. The hierarchical policy, with an appropriate subgoal step, often yields significantly lower errors than the flat policy.
  • Figure 5: State-based benchmark environments.
  • ...and 15 more figures

Theorems & Definitions (4)

  • Proposition 4.1
  • Proposition 5.1: Goal representations from the value function are sufficient for action selection
  • proof
  • proof