VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning
Yi Du, Taimeng Fu, Zhuoqun Chen, Bowen Li, Shaoshu Su, Zhipeng Zhao, Chen Wang
TL;DR
VL-Nav tackles vision-language navigation in unseen environments by fusing pixel-wise vision-language features with curiosity-driven exploration and spatial reasoning. The system combines rolling occupancy maps, frontier and instance-based targets, and a CVL scoring policy to select informative goals, then uses the FAR planner for real-time path planning on low-power hardware at 30 Hz. Key contributions include integrating instance-based target points, Gaussian-mixture VL scoring, and curiosity cues to achieve robust zero-shot VLN with high SR/SPL across indoor and outdoor settings, outperforming VLFM and classical baselines by substantial margins. The practical impact lies in enabling reliable, semantically guided robot navigation on resource-constrained platforms, with real-world demonstrations validating efficiency and scalability. Future work envisions handling more complex, multi-step instructions and tighter integration with large language models for broader open-vocabulary perception.
Abstract
Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as "find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration. This approach enables robust navigation to human-instructed instances across diverse environments. We deploy VL-Nav on a four-wheel mobile robot and evaluate its performance through comprehensive navigation tasks in both indoor and outdoor environments, spanning different scales and semantic complexities. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
