Table of Contents
Fetching ...

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio

TL;DR

VLMnav demonstrates that an off-the-shelf Vision-Language Model can serve as a zero-shot, end-to-end navigation policy by reframing navigation as a question-answering task grounded in a discretized action space. The architecture combines depth-informed navigability, an exploration-biased action proposer, visual action projection, and a prompting strategy that yields a one-step action decision, with a separate termination prompt. On ObjectNav and GOAT benchmarks, it outperforms prior prompting baselines like PIVOT and reveals design sensitivities to field-of-view and depth-perception quality, while still trailing specialized systems in some scenarios. This work suggests that leveraging VLMs for embodied tasks can generalize across navigation goals with minimal task-specific data, paving the way for simpler, more adaptable navigation systems as VLMs mature.

Abstract

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

TL;DR

VLMnav demonstrates that an off-the-shelf Vision-Language Model can serve as a zero-shot, end-to-end navigation policy by reframing navigation as a question-answering task grounded in a discretized action space. The architecture combines depth-informed navigability, an exploration-biased action proposer, visual action projection, and a prompting strategy that yields a one-step action decision, with a separate termination prompt. On ObjectNav and GOAT benchmarks, it outperforms prior prompting baselines like PIVOT and reveals design sensitivities to field-of-view and depth-perception quality, while still trailing specialized systems in some scenarios. This work suggests that leveraging VLMs for embodied tasks can generalize across navigation goals with minimal task-specific data, paving the way for simpler, more adaptable navigation systems as VLMs mature.

Abstract

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The full action prompt for VLMnav consists of three parts: A system prompt to describe the embodiment, an action prompt to describe the task, the potential actions, and the output instruction, and an image prompt showing the current observation along with the annotated actions
  • Figure 2: Approach: Our method is made up of four key components: (i) Navigability, which determines locations the agent can actually move to, and updates the voxel map accordingly. An example update step to the map shows the marking of new area as explored (gray) or unexplored (green). (ii) Action Proposer, which refines a set of final actions according to spacing and exploration. (iii) Projection, which visually annotates the image with actions. (iv) Prompting, which constructs a detailed chain-of-thought prompt to select an action.
  • Figure 3: An example step of the Navigability subroutine. The navigability mask is shown in blue and polar actions making up $A_\text{initial}$ are in green
  • Figure 4: The separate prompt for determining episode termination
  • Figure 5: Baselines: Comparing the four different methods on a sample image. Ours contains arrows that point to navigable locations, PIVOT has arrows sampled from a random 2-D Gaussian, Ours w/o nav sees uniformly spaced arrows (note arrows 3 and 5 point into a wall), and Prompt Only sees just the raw RGB image
  • ...and 1 more figures