Table of Contents
Fetching ...

Offline Goal-Conditioned Reinforcement Learning with Projective Quasimetric Planning

Anthony Kobanda, Waris Radji, Mathieu Petitbois, Odalric-Ambrym Maillard, Rémy Portelas

TL;DR

This work addresses offline goal-conditioned reinforcement learning for long-horizon tasks by introducing Projective Quasimetric Planning (ProQ), a geometry-driven framework that learns a directional latent space and a sparse set of uniformly distributed keypoints. The latent space is shaped by an encoder, an asymmetric quasimetric, and an OOD detector, with keypoints driven by Coulomb-like repulsion and an OOD barrier to ensure coverage within the reachable data manifold. Planning proceeds via a directed graph of keypoints and Floyd–Warshall lookups, while short-horizon control is provided by an Advantage Weighted Regression policy that moves between keypoints. On the PointMaze benchmarks from OGBench, ProQ achieves state-of-the-art success rates, demonstrating robust long-horizon navigation with efficient planning and informative latent mappings; ablations confirm the necessity of the OOD barrier for maintaining feasible plans and coverage.

Abstract

Offline Goal-Conditioned Reinforcement Learning seeks to train agents to reach specified goals from previously collected trajectories. Scaling that promises to long-horizon tasks remains challenging, notably due to compounding value-estimation errors. Principled geometric offers a potential solution to address these issues. Following this insight, we introduce Projective Quasimetric Planning (ProQ), a compositional framework that learns an asymmetric distance and then repurposes it, firstly as a repulsive energy forcing a sparse set of keypoints to uniformly spread over the learned latent space, and secondly as a structured directional cost guiding towards proximal sub-goals. In particular, ProQ couples this geometry with a Lagrangian out-of-distribution detector to ensure the learned keypoints stay within reachable areas. By unifying metric learning, keypoint coverage, and goal-conditioned control, our approach produces meaningful sub-goals and robustly drives long-horizon goal-reaching on diverse a navigation benchmarks.

Offline Goal-Conditioned Reinforcement Learning with Projective Quasimetric Planning

TL;DR

This work addresses offline goal-conditioned reinforcement learning for long-horizon tasks by introducing Projective Quasimetric Planning (ProQ), a geometry-driven framework that learns a directional latent space and a sparse set of uniformly distributed keypoints. The latent space is shaped by an encoder, an asymmetric quasimetric, and an OOD detector, with keypoints driven by Coulomb-like repulsion and an OOD barrier to ensure coverage within the reachable data manifold. Planning proceeds via a directed graph of keypoints and Floyd–Warshall lookups, while short-horizon control is provided by an Advantage Weighted Regression policy that moves between keypoints. On the PointMaze benchmarks from OGBench, ProQ achieves state-of-the-art success rates, demonstrating robust long-horizon navigation with efficient planning and informative latent mappings; ablations confirm the necessity of the OOD barrier for maintaining feasible plans and coverage.

Abstract

Offline Goal-Conditioned Reinforcement Learning seeks to train agents to reach specified goals from previously collected trajectories. Scaling that promises to long-horizon tasks remains challenging, notably due to compounding value-estimation errors. Principled geometric offers a potential solution to address these issues. Following this insight, we introduce Projective Quasimetric Planning (ProQ), a compositional framework that learns an asymmetric distance and then repurposes it, firstly as a repulsive energy forcing a sparse set of keypoints to uniformly spread over the learned latent space, and secondly as a structured directional cost guiding towards proximal sub-goals. In particular, ProQ couples this geometry with a Lagrangian out-of-distribution detector to ensure the learned keypoints stay within reachable areas. By unifying metric learning, keypoint coverage, and goal-conditioned control, our approach produces meaningful sub-goals and robustly drives long-horizon goal-reaching on diverse a navigation benchmarks.

Paper Structure

This paper contains 56 sections, 8 theorems, 28 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

IQE Universal Approximation General Case (See iqe for the demonstration) : Consider any quasimetric space $(\mathcal{X},d)$ where $\mathcal{X}$ is compact and $d$ is continuous. $\forall~\epsilon>0$,- with sufficiently large $N$, there exists a continuous encoder $f_{\theta'}:\mathcal{X}\rightarrow\

Figures (8)

  • Figure 1: Projective Quasimetric Planning (ProQ) : Building a Precise State-Space Mapping. From left to right, we show how ProQ turns unlabeled traces into a geometry-aware navigation map: (a) We start with a dataset of transitions ; (b) We jointly train : $\phi_{\theta_\phi}$, an encoder ; $d_{\theta_d}$, a quasimetric ; $\psi_{\theta_\psi}$, an out-of-distribution classifier ; (c) Using $\phi$, $d$, and $\psi$, we initialize a small set of latent keypoints and let them evolve as identical particles under two energy based forces : a Coulomb repulsion ensuring they uniformly spread across the latent space ; an OOD barrier keeping them within the in-distribution manifold ; (d) To navigate the resulting space, we do path planning using Floyd-Warshall and action selection with an AWR-trained policy.
  • Figure 2: Layouts of the four PointMaze tasks used in our experiments. All images are rendered on the same grid resolution (blue squares have identical side length), so the overall arena grows from Medium to Giant. The Teleport task adds blue portal cells that stochastically move the agent to another non-local location.
  • Figure 3: Illustration of the learned latent mappings produced by ProQ on the four PointMaze tasks. OOD probabilities are shown from $0\sim$yellow to $1\sim$magenta ; The learned keypoints are represented in red.
  • Figure 4: Two plans produced by ProQ with Floyd–Warshall on the giant maze.
  • Figure 5: Key points learned without the OOD barrier on the giant maze; the yellow area is the reachable manifold.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 4.1
  • Theorem 4.2
  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • Lemma D.3
  • proof
  • ...and 2 more