Table of Contents
Fetching ...

Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning

Mobin Habibpour, Fatemeh Afghah

TL;DR

This work tackles zero-shot Object Goal Navigation by elevating a Vision-Language Model to the role of primary planner. It presents a VLM-powered exploration framework that combines structured Chain-of-Thought reasoning, dynamic prompts with recent action history, and multimodal inputs including a top-down obstacle map to steer frontier-based exploration. Evaluations on HM3D, Gibson, and MP3D show improved trajectory directness and navigation efficiency, with ablations confirming the critical roles of CoT and memory. The results demonstrate the potential of VLMs as embodied planners for robotics, while acknowledging computational costs and reliance on manually crafted prompts as areas for future improvement.

Abstract

While Vision-Language Models (VLMs) are set to transform robotic navigation, existing methods often underutilize their reasoning capabilities. To unlock the full potential of VLMs in robotics, we shift their role from passive observers to active strategists in the navigation process. Our framework outsources high-level planning to a VLM, which leverages its contextual understanding to guide a frontier-based exploration agent. This intelligent guidance is achieved through a trio of techniques: structured chain-of-thought prompting that elicits logical, step-by-step reasoning; dynamic inclusion of the agent's recent action history to prevent getting stuck in loops; and a novel capability that enables the VLM to interpret top-down obstacle maps alongside first-person views, thereby enhancing spatial awareness. When tested on challenging benchmarks like HM3D, Gibson, and MP3D, this method produces exceptionally direct and logical trajectories, marking a substantial improvement in navigation efficiency over existing approaches and charting a path toward more capable embodied agents.

Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning

TL;DR

This work tackles zero-shot Object Goal Navigation by elevating a Vision-Language Model to the role of primary planner. It presents a VLM-powered exploration framework that combines structured Chain-of-Thought reasoning, dynamic prompts with recent action history, and multimodal inputs including a top-down obstacle map to steer frontier-based exploration. Evaluations on HM3D, Gibson, and MP3D show improved trajectory directness and navigation efficiency, with ablations confirming the critical roles of CoT and memory. The results demonstrate the potential of VLMs as embodied planners for robotics, while acknowledging computational costs and reliance on manually crafted prompts as areas for future improvement.

Abstract

While Vision-Language Models (VLMs) are set to transform robotic navigation, existing methods often underutilize their reasoning capabilities. To unlock the full potential of VLMs in robotics, we shift their role from passive observers to active strategists in the navigation process. Our framework outsources high-level planning to a VLM, which leverages its contextual understanding to guide a frontier-based exploration agent. This intelligent guidance is achieved through a trio of techniques: structured chain-of-thought prompting that elicits logical, step-by-step reasoning; dynamic inclusion of the agent's recent action history to prevent getting stuck in loops; and a novel capability that enables the VLM to interpret top-down obstacle maps alongside first-person views, thereby enhancing spatial awareness. When tested on challenging benchmarks like HM3D, Gibson, and MP3D, this method produces exceptionally direct and logical trajectories, marking a substantial improvement in navigation efficiency over existing approaches and charting a path toward more capable embodied agents.

Paper Structure

This paper contains 13 sections, 2 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our System Pipeline. (1) Sensor data is used to create an Obstacle Map. (2) Geometric frontiers are detected on this map. (3) The LLaVA-1.6 VLM analyzes the agent's egocentric view, the map, and a dynamic prompt that includes action history. (4) The VLM produces semantic scores, which are then used to build a Value Map that indicates the relevance of different areas. (5) The Frontier and Value Maps are combined to prioritize waypoints, directing the agent toward the most promising regions.
  • Figure 2: Conceptual illustration of the value map components. (a) Robot's field of view (FOV) and the associated viewing uncertainty cone. (b) Example action space visualization with VLM-assigned scores: [Forward: 0.9, Backward: 0, Right: 0, Left: 0.1].
  • Figure 3: Dual visual inputs for the VLM: (a) the top-down map shows the spatial layout with obstacles (in gray) and the agent's heading (arrow), while (b) the egocentric view provides a first-person perspective. This combination improves the VLM's spatial understanding.
  • Figure 4: Example of potential decision loop. Without action history, oscillating evaluations between points (a) and (b) could cause stagnation. Tracking history helps break such cycles.
  • Figure 5: Qualitative analysis of navigation with and without our full CoT framework. The agent's view and the VLM's reasoning are shown at various timesteps. The top row (No CoT) displays an agent with basic reasoning that wanders aimlessly and fails to locate the target. The bottom row (Full CoT) illustrates how a structured, step-by-step reasoning process (such as identifying a bathroom, realizing a TV is not there, and choosing to leave) results in a more intelligent exploration strategy and a direct, successful path. This comparison underscores the crucial role of CoT in achieving more effective and intelligent navigation.
  • ...and 1 more figures