Offline Goal-Conditioned Reinforcement Learning with Projective Quasimetric Planning
Anthony Kobanda, Waris Radji, Mathieu Petitbois, Odalric-Ambrym Maillard, Rémy Portelas
TL;DR
This work addresses offline goal-conditioned reinforcement learning for long-horizon tasks by introducing Projective Quasimetric Planning (ProQ), a geometry-driven framework that learns a directional latent space and a sparse set of uniformly distributed keypoints. The latent space is shaped by an encoder, an asymmetric quasimetric, and an OOD detector, with keypoints driven by Coulomb-like repulsion and an OOD barrier to ensure coverage within the reachable data manifold. Planning proceeds via a directed graph of keypoints and Floyd–Warshall lookups, while short-horizon control is provided by an Advantage Weighted Regression policy that moves between keypoints. On the PointMaze benchmarks from OGBench, ProQ achieves state-of-the-art success rates, demonstrating robust long-horizon navigation with efficient planning and informative latent mappings; ablations confirm the necessity of the OOD barrier for maintaining feasible plans and coverage.
Abstract
Offline Goal-Conditioned Reinforcement Learning seeks to train agents to reach specified goals from previously collected trajectories. Scaling that promises to long-horizon tasks remains challenging, notably due to compounding value-estimation errors. Principled geometric offers a potential solution to address these issues. Following this insight, we introduce Projective Quasimetric Planning (ProQ), a compositional framework that learns an asymmetric distance and then repurposes it, firstly as a repulsive energy forcing a sparse set of keypoints to uniformly spread over the learned latent space, and secondly as a structured directional cost guiding towards proximal sub-goals. In particular, ProQ couples this geometry with a Lagrangian out-of-distribution detector to ensure the learned keypoints stay within reachable areas. By unifying metric learning, keypoint coverage, and goal-conditioned control, our approach produces meaningful sub-goals and robustly drives long-horizon goal-reaching on diverse a navigation benchmarks.
