Table of Contents
Fetching ...

VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

Lazar Milikic, Manthan Patel, Jonas Frey

TL;DR

This work tackles the challenge of scalable, robust navigation by decoupling perception from control. It introduces Vision-Language Distance (VLD), a self-supervised, multimodal distance predictor trained on internet-scale video data, coupled with a simulation-trained RL policy that uses this distance signal during deployment. A Gaussian mixture NLL objective provides calibrated distance and confidence estimates, enabling ordinal consistency (via Kendall's tau) as a principled evaluation and enabling policy transfer when substituting predicted distances at run time. The approach supports image, text, or multimodal goals and demonstrates strong ordinal consistency and competitive navigation performance in simulation, with promising sim-to-real transfer characteristics due to reliance on a scalar distance signal. Overall, VLD offers a scalable path toward reliable, multimodal visual navigation by separating perception pretraining from policy learning and focusing RL on robust low-level control within a vision-language aware distance framework.

Abstract

Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.

VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

TL;DR

This work tackles the challenge of scalable, robust navigation by decoupling perception from control. It introduces Vision-Language Distance (VLD), a self-supervised, multimodal distance predictor trained on internet-scale video data, coupled with a simulation-trained RL policy that uses this distance signal during deployment. A Gaussian mixture NLL objective provides calibrated distance and confidence estimates, enabling ordinal consistency (via Kendall's tau) as a principled evaluation and enabling policy transfer when substituting predicted distances at run time. The approach supports image, text, or multimodal goals and demonstrates strong ordinal consistency and competitive navigation performance in simulation, with promising sim-to-real transfer characteristics due to reliance on a scalar distance signal. Overall, VLD offers a scalable path toward reliable, multimodal visual navigation by separating perception pretraining from policy learning and focusing RL on robust low-level control within a vision-language aware distance framework.

Abstract

Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.

Paper Structure

This paper contains 55 sections, 11 equations, 19 figures, 12 tables, 2 algorithms.

Figures (19)

  • Figure 1: Overview of our framework. (A) Training stage: we separately train a temporal Vision–Language Distance (VLD) function on diverse real-world and synthetic video datasets, and an RL navigation policy in simulation using geometric distance-to-goal signals with injected noise to mimic real predictor uncertainty. (B) Deployment stage: the trained RL policy consumes predictions from the learned VLD model—specified by either image or text goals—to navigate in simulated and real-world environments.
  • Figure 2: Vision–Language Distance (VLD) architecture. Egocentric observations and a goal (image and/or text) are encoded into tokens using frozen backbones (DINOv2 for images, CLIP for text). Text tokens are projected into the DINOv2 embedding space so that both modalities lie in the same space as the observation tokens. A Transformer decoder attends from observation queries to goal tokens, and the CLS output is used by MLP heads to predict temporal distance and a confidence score.
  • Figure 3: Ordinal consistency analysis on image-goal examples. For each trajectory, models compute distances independently at every time step using the last frame as the goal. Top rows: normalized distance curves with associated Kendall’s $\tau$ values (left), VLD confidence evolution (middle), and the goal image or goal text (right). Bottom rows: the sequence of agent observations along the trajectory. Across both examples, VLD exhibits strong monotonic alignment with ground truth and meaningful confidence behavior, while baselines either drift or fail to reflect appearance-based changes as reliably.
  • Figure 4: Ordinal consistency analysis for text-goal Habitat example (VLD: text, image, multimodal). VLD computes distances independently at every time step using the last frame or text prompt as the goal. Top row: normalized distance predictions with corresponding Kendall’s $\tau$ scores (left), VLD confidence curve (middle), and the goal image (right). Bottom row: the sequence of observations along the trajectory. The automatically bootstrapped (Appendix \ref{['appdx:data_collection']}) text description (“plant on the wall above the bed in the room”) provides only partial and even occasionally ambiguous guidance. Nonetheless, text-only VLD predictions follow the global decreasing trend and remain broadly aligned with the image-goal variants, albeit with higher noise—as expected for semantic-only supervision. As the agent approaches the room containing the goal, confidence increases and predicted distances fall smoothly, though not exactly to zero—an expected outcome for semantic (non-pixel-aligned) goal specifications. The multimodal version closely tracks the image-only-goal predictions, demonstrating that linguistic cues can reinforce—but do not override—visual distance estimation.
  • Figure 5: Negative mining teaches the model to separate unrelated scenes. Given two observations from completely different environments (left: kitchen; right: spaceship), the desired behavior is to output a distance close to $td_{\max}$, indicating that the images cannot correspond to nearby states on any trajectory. Only the model trained with negative mining learns this behavior, predicting a high distance with high confidence; the model trained without it collapses toward an uncertain mid-range prediction.
  • ...and 14 more figures