Table of Contents
Fetching ...

Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations

Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, Sergey Levine

TL;DR

This work addresses offline goal-conditioned reinforcement learning by learning temporal distances that enable optimal goal-reaching even with suboptimal data and stochastic dynamics. It introduces Temporal Metric Distillation (TMD), a framework that unifies contrastive successor representations with quasimetric triangle inequality, supplemented by action and temporal invariances, to recover the optimal successor distance through a fixed-point formulation. The authors provide both theoretical convergence to the optimal distance and practical algorithmic components, including contrastive initialization and invariant losses, and demonstrate superior performance on offline benchmarks with ablations underscoring the importance of each component. The approach offers a principled path to stitching and long-horizon planning in offline, high-dimensional settings, with broad implications for scalable, robust GCRL.

Abstract

Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) *contrastive representations*, in which methods learn "successor features" with a contrastive objective that performs inference over future outcomes, and (2) *temporal distances*, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a **quasimetric** distance parameterization to learn **optimal** goal-reaching distances, even with **suboptimal** data and in **stochastic** environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.

Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations

TL;DR

This work addresses offline goal-conditioned reinforcement learning by learning temporal distances that enable optimal goal-reaching even with suboptimal data and stochastic dynamics. It introduces Temporal Metric Distillation (TMD), a framework that unifies contrastive successor representations with quasimetric triangle inequality, supplemented by action and temporal invariances, to recover the optimal successor distance through a fixed-point formulation. The authors provide both theoretical convergence to the optimal distance and practical algorithmic components, including contrastive initialization and invariant losses, and demonstrate superior performance on offline benchmarks with ablations underscoring the importance of each component. The approach offers a principled path to stitching and long-horizon planning in offline, high-dimensional settings, with broad implications for scalable, robust GCRL.

Abstract

Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) *contrastive representations*, in which methods learn "successor features" with a contrastive objective that performs inference over future outcomes, and (2) *temporal distances*, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a **quasimetric** distance parameterization to learn **optimal** goal-reaching distances, even with **suboptimal** data and in **stochastic** environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.

Paper Structure

This paper contains 31 sections, 6 theorems, 49 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

theorem 1

Take $d \in$ and consider the sequence Then, $d_n$ converges uniformly to a fixed point $d_{\infty} \in \cQ$.

Figures (5)

  • Figure 1: TMD learns a temporal distance $d_{\theta}$ that satisfies the triangle inequality and action invariance. It does this by minimizing the distance between the learned distance and the distance between the successor features of the states and actions in the dataset. The learned distance is used to extract a goal-conditioned policy.
  • Figure 2: TMD enables key capabilities over prior work: handling stochastic transition dynamics, learning optimal policies from offline data, and stitching behaviors as a property of network architecture.
  • Figure 3: An example distance heatmap learned by in pointmaze_large_stitch. Darker colors indicate larger distances.
  • Figure 4: We ablate the loss components of in the pointmaze_teleport_stitch environment.
  • Figure 5: Comparison of Bregman divergences for $e^{-d}$ onto $e^{-d'}$ in expectation. All losses are minimized at $d = d'$, and share the property that they will be minimized in expectation when $e^{-d} = \mathbb{E} [ e ^{ -d'}]$. But only the $D_T(d,d')$ loss has non-vanishing gradients $d \gg d'$ for large $d'$.

Theorems & Definitions (14)

  • theorem 1
  • remark 1
  • theorem 2
  • lemma 1
  • proof
  • proof : Proof of \ref{['thm:convergence']}
  • proof : Proof of \ref{['thm:fixed_point']}
  • lemma 2
  • proof
  • lemma 3
  • ...and 4 more