End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering
Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio
TL;DR
VLMnav demonstrates that an off-the-shelf Vision-Language Model can serve as a zero-shot, end-to-end navigation policy by reframing navigation as a question-answering task grounded in a discretized action space. The architecture combines depth-informed navigability, an exploration-biased action proposer, visual action projection, and a prompting strategy that yields a one-step action decision, with a separate termination prompt. On ObjectNav and GOAT benchmarks, it outperforms prior prompting baselines like PIVOT and reveals design sensitivities to field-of-view and depth-perception quality, while still trailing specialized systems in some scenarios. This work suggests that leveraging VLMs for embodied tasks can generalize across navigation goals with minimal task-specific data, paving the way for simpler, more adaptable navigation systems as VLMs mature.
Abstract
We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/
