Table of Contents
Fetching ...

Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, Benjamin Eysenbach

TL;DR

This work defines a temporal distance for goal-reaching in stochastic environments that satisfies the triangle inequality by applying a simple change of variables to contrastive successor features. The distance, d_{SD}, is grounded in discounted future occupancy measures and extended to state-action spaces, with theoretical guarantees that it forms a quasimetric. The authors implement two distillation schemes (CMD-1 and CMD-2) to learn a usable quasimetric via contrastive learning, culminating in a MRN-based parameterization that supports efficient RL with policy extraction. Empirical results on synthetic navigation and a 111-dimensional AntMaze task show strong combinatorial generalization (stitching) and competitive performance, illustrating the practical impact of metric-distilled temporal distances for goal-conditioned control.

Abstract

Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not merely a definitional concern, but translates to an inability to generalize and find shortest paths. In this paper, we build on prior work in contrastive learning and quasimetrics to show how successor features learned by contrastive learning (after a change of variables) form a temporal distance that does satisfy the triangle inequality, even in stochastic settings. Importantly, this temporal distance is computationally efficient to estimate, even in high-dimensional and stochastic settings. Experiments in controlled settings and benchmark suites demonstrate that an RL algorithm based on these new temporal distances exhibits combinatorial generalization (i.e., "stitching") and can sometimes learn more quickly than prior methods, including those based on quasimetrics.

Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

TL;DR

This work defines a temporal distance for goal-reaching in stochastic environments that satisfies the triangle inequality by applying a simple change of variables to contrastive successor features. The distance, d_{SD}, is grounded in discounted future occupancy measures and extended to state-action spaces, with theoretical guarantees that it forms a quasimetric. The authors implement two distillation schemes (CMD-1 and CMD-2) to learn a usable quasimetric via contrastive learning, culminating in a MRN-based parameterization that supports efficient RL with policy extraction. Empirical results on synthetic navigation and a 111-dimensional AntMaze task show strong combinatorial generalization (stitching) and competitive performance, illustrating the practical impact of metric-distilled temporal distances for goal-conditioned control.

Abstract

Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not merely a definitional concern, but translates to an inability to generalize and find shortest paths. In this paper, we build on prior work in contrastive learning and quasimetrics to show how successor features learned by contrastive learning (after a change of variables) form a temporal distance that does satisfy the triangle inequality, even in stochastic settings. Importantly, this temporal distance is computationally efficient to estimate, even in high-dimensional and stochastic settings. Experiments in controlled settings and benchmark suites demonstrate that an RL algorithm based on these new temporal distances exhibits combinatorial generalization (i.e., "stitching") and can sometimes learn more quickly than prior methods, including those based on quasimetrics.

Paper Structure

This paper contains 36 sections, 17 theorems, 56 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Lemma 3.0

$d_{\textsc{sd}}\bigl( (s,a), (s',a') \bigr)$ is independent of $a'$ when $s \neq s'$.

Figures (4)

  • Figure 1: An overview of our theoretical distance construction as well as the concrete implementation with metric distillation.
  • Figure 2: (Left) We collect four types of trajectories on this 2D navigation task. The large gray arrows depict the direction of motion. Note that navigating between certain states requires piecing together trajectories of different colors. (Right) Our proposed temporal distance correctly pieces together trajectories, allowing an RL agent to successfully navigate between pairs of states that never occur on the same trajectory. This combinatorial generalization ghugare2024closing or "stitching" fu2020d4rl property is typically associated with bootstrapping with temporal difference learning, which our temporal distances do not require.
  • Figure 3: Metric distillation enables more efficient offline training and long-horizon compositional generalization. Results are plotted with one standard error.
  • Figure 4: A simple illustration of a metric over $\mathcal{S}\times\mathcal{A}$. To stitch the behavior $s\to w$ from $\pi_1$ and $w\to g$ from $\pi_2$ to the behavior $s\to g$ that is possible under some policy $\pi$', we enforce an additional constraint that distances to $(w,\to)$ are the same as distances to $(w,\circlearrowright)$.

Theorems & Definitions (31)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Lemma 3.0
  • Lemma 3.0
  • Lemma 3.0
  • Theorem 3.1
  • Corollary 3.1.1
  • Corollary 3.1.2
  • Lemma 4.0
  • ...and 21 more