Table of Contents
Fetching ...

VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning

Yi Du, Taimeng Fu, Zhuoqun Chen, Bowen Li, Shaoshu Su, Zhipeng Zhao, Chen Wang

TL;DR

VL-Nav tackles vision-language navigation in unseen environments by fusing pixel-wise vision-language features with curiosity-driven exploration and spatial reasoning. The system combines rolling occupancy maps, frontier and instance-based targets, and a CVL scoring policy to select informative goals, then uses the FAR planner for real-time path planning on low-power hardware at 30 Hz. Key contributions include integrating instance-based target points, Gaussian-mixture VL scoring, and curiosity cues to achieve robust zero-shot VLN with high SR/SPL across indoor and outdoor settings, outperforming VLFM and classical baselines by substantial margins. The practical impact lies in enabling reliable, semantically guided robot navigation on resource-constrained platforms, with real-world demonstrations validating efficiency and scalability. Future work envisions handling more complex, multi-step instructions and tighter integration with large language models for broader open-vocabulary perception.

Abstract

Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as "find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration. This approach enables robust navigation to human-instructed instances across diverse environments. We deploy VL-Nav on a four-wheel mobile robot and evaluate its performance through comprehensive navigation tasks in both indoor and outdoor environments, spanning different scales and semantic complexities. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.

VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning

TL;DR

VL-Nav tackles vision-language navigation in unseen environments by fusing pixel-wise vision-language features with curiosity-driven exploration and spatial reasoning. The system combines rolling occupancy maps, frontier and instance-based targets, and a CVL scoring policy to select informative goals, then uses the FAR planner for real-time path planning on low-power hardware at 30 Hz. Key contributions include integrating instance-based target points, Gaussian-mixture VL scoring, and curiosity cues to achieve robust zero-shot VLN with high SR/SPL across indoor and outdoor settings, outperforming VLFM and classical baselines by substantial margins. The practical impact lies in enabling reliable, semantically guided robot navigation on resource-constrained platforms, with real-world demonstrations validating efficiency and scalability. Future work envisions handling more complex, multi-step instructions and tighter integration with large language models for broader open-vocabulary perception.

Abstract

Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as "find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration. This approach enables robust navigation to human-instructed instances across diverse environments. We deploy VL-Nav on a four-wheel mobile robot and evaluate its performance through comprehensive navigation tasks in both indoor and outdoor environments, spanning different scales and semantic complexities. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.

Paper Structure

This paper contains 38 sections, 8 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: We propose VL-Nav, a real-time zero-shot vision-language navigation approach with spatial reasoning that integrates pixel-wise vision-language features and curiosity-based exploration for mobile robots. (a) Hallway: The wheeled robot is tasked with "find a man in gray" in a hallway. Unlike the classical frontier-based method (red line) and VLFM (green line), VL-Nav (blue line) leverages pixel-wise vision-language (VL) features from the "gray cloth" cue for spatial reasoning, selecting the most VL-correlated goal point and successfully locating the missing person. The value map shows that the "gray cloth" VL cue prioritizes the right-side area, marked by yellow square points. (b) Apartment: The robot is tasked with "Go to the tall white trash bin." It detected two different-sized white trash bins in bottom camera observation. However, it assigns a higher confidence score (0.98) to the taller bin than the shorter one (0.48). These pixel-wise VL features are incorporated into the spatial distribution to select the correct goal point, guiding the robot toward the taller bin.
  • Figure 2: An overview of the VL-Nav pipeline. VL-Nav processes inputs including prompts, RGB images, odometry poses, and LiDAR scans. The Vision-Language (VL) module conducts open-vocabulary pixel-wise detection to identify areas and objects related to the prompt, generating instance-based target points. Concurrently, the map module performs terrain analysis and manages a dynamic occupancy map. Frontier-based target points are then identified based on this occupancy map, along with the instance points, forming a candidate points pool. VL-Nav employs spatial reasoning to select the most effective goal point from this pool for path planning.
  • Figure 3: A brief illustration of VL Scoring. The pixel-wise open-vocabulary detection results are transferred into the spatial distibution via the Gaussian mixture model regularized by the FOV weighting (the gray arrows in the figure). Then the frontier-based and the instance-based target points will be assigned with VL score based on the distribution Equation (\ref{['eq:VL']}).
  • Figure 4: Four different real-world experiment environments.
  • Figure 5: Top-down view of the trajectories comparison on the value maps with the detection results across the four different environments.
  • ...and 1 more figures