Table of Contents
Fetching ...

Foundation Policies with Hilbert Representations

Seohong Park, Tobias Kreiman, Sergey Levine

TL;DR

Foundation policies are learned offline by mapping states into a Hilbert latent space with a distance-preserving representation $\phi$ such that $d^*(s,g) \approx \|\phi(s) - \phi(g)\|$. A latent-conditioned policy $\pi(a|s,z)$ then explores the latent space via directional rewards $r(s,z,s') = \langle \phi(s') - \phi(s), z \rangle$, yielding diverse long-horizon skills trained via offline RL. The key contributions are (i) a principled offline pre-training objective that yields a structured Hilbert representation and a versatile HILP, (ii) zero-shot policy prompting via linear regression over $\tilde{\phi}(s,a,s')$ and (iii) test-time planning using a minimax midpoint subgoal to refine latent prompts, all demonstrated across seven robotic tasks with strong zero-shot and planning-enabled performance. The approach enables rapid adaptation to downstream tasks without additional online data, with practical impact for building generalist policies from unlabeled data and enabling efficient downstream task solving. Limitations include the assumption of a relatively well-behaved Hilbert embedding and potential challenges in highly asymmetric or partially observable MDPs, motivating future work on more general embeddings and fine-tuning strategies.

Abstract

Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/.

Foundation Policies with Hilbert Representations

TL;DR

Foundation policies are learned offline by mapping states into a Hilbert latent space with a distance-preserving representation such that . A latent-conditioned policy then explores the latent space via directional rewards , yielding diverse long-horizon skills trained via offline RL. The key contributions are (i) a principled offline pre-training objective that yields a structured Hilbert representation and a versatile HILP, (ii) zero-shot policy prompting via linear regression over and (iii) test-time planning using a minimax midpoint subgoal to refine latent prompts, all demonstrated across seven robotic tasks with strong zero-shot and planning-enabled performance. The approach enables rapid adaptation to downstream tasks without additional online data, with practical impact for building generalist policies from unlabeled data and enabling efficient downstream task solving. Limitations include the assumption of a relatively well-behaved Hilbert embedding and potential challenges in highly asymmetric or partially observable MDPs, motivating future work on more general embeddings and fine-tuning strategies.

Abstract

Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/.
Paper Structure (21 sections, 4 theorems, 21 equations, 17 figures, 7 tables, 2 algorithms)

This paper contains 21 sections, 4 theorems, 21 equations, 17 figures, 7 tables, 2 algorithms.

Key Result

Theorem 5.1

If embedding errors are bounded as $\sup_{s, g \in {\mathcal{S}}} |d^*(s, g) - \|\phi(s) - \phi(g)\|| \leq {\varepsilon}_e$, directional movement errors are bounded as $\sup_{s, g \in {\mathcal{S}}} \|z'^*(s, g) - \hat{z}'(s, g)\| \leq {\varepsilon}_d$, and $\, 4{\varepsilon}_e + {\varepsilon}_d < 1

Figures (17)

  • Figure 1: Illustration of HILPs. (left) We first train a distance-preserving mapping $\phi:{\mathcal{S}} \to {\mathcal{Z}}$ that maps temporally similar states to spatially similar latent states ($d^*$ denotes the temporal distance). (right) We then train a latent-conditioned policy $\pi(a \mid s, z)$, which we call a Hilbert foundation policy, that spans that latent space with directional movements. This policy captures diverse long-horizon behaviors from unlabeled data, which can be directly used to solve a variety of downstream tasks efficiently, even in a zero-shot manner.
  • Figure 2: Diagram of HILPs. (a) We train a Hilbert representation $\phi(s)$ using a goal-conditioned value learning objective with the value function parameterized as $V(s, g) = -\|\phi(s) - \phi(g)\|$. (b) We train a Hilbert foundation policy $\pi(a \mid s, z)$ using the intrinsic reward function $r(s, z, s')$ defined as the inner product between $\phi(s') - \phi(s)$ and a randomly sampled unit vector $z$.
  • Figure 3: Test-time midpoint planning. In the presence of embedding errors, the direction toward the midpoint subgoal $w^*$ can be more accurate than the direction toward the goal $g$.
  • Figure 4: Environments. We evaluate HILPs on seven robotic locomotion and manipulation environments.
  • Figure 5: Zero-shot RL performance. HILP achieves the best zero-shot RL performance in the ExORL benchmark, outperforming previous state-of-the-art approaches. The overall results are aggregated over $4$ environments, $4$ tasks, $4$ datasets, and $4$ seeds (i.e., $256$ values in total).
  • ...and 12 more figures

Theorems & Definitions (6)

  • Theorem 5.1: Directional movements in the latent space are optimal for goal reaching
  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Corollary 3.3