Table of Contents
Fetching ...

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang, Xinxin Zhao, Wenzhe Cai, Changyin Sun

TL;DR

ImagineNav++ introduces a mapless, open-vocabulary navigation framework that leverages scene imagination and a memory-augmented Vision-Language Model to select informative future viewpoints as subgoals. The Where2Imagine module generates plausible future observations, which are evaluated by a VLM guided by a selective foveation memory and a GPT-4o-mini planner to drive long-horizon exploration without task-specific training. Key contributions include the Where2Imagine imager, a hierarchical memory built on DINOv2 embeddings, and a diffusion-based novel view synthesis pipeline that grounds VLM reasoning in spatial structure. Experimental results on ObjectNav and InsINav demonstrate state-of-the-art performance in mapless settings, with strong robustness and efficiency, underscoring the value of scene imagination and memory in VLM-based embodied navigation.

Abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

TL;DR

ImagineNav++ introduces a mapless, open-vocabulary navigation framework that leverages scene imagination and a memory-augmented Vision-Language Model to select informative future viewpoints as subgoals. The Where2Imagine module generates plausible future observations, which are evaluated by a VLM guided by a selective foveation memory and a GPT-4o-mini planner to drive long-horizon exploration without task-specific training. Key contributions include the Where2Imagine imager, a hierarchical memory built on DINOv2 embeddings, and a diffusion-based novel view synthesis pipeline that grounds VLM reasoning in spatial structure. Experimental results on ObjectNav and InsINav demonstrate state-of-the-art performance in mapless settings, with strong robustness and efficiency, underscoring the value of scene imagination and memory in VLM-based embodied navigation.

Abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

Paper Structure

This paper contains 26 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The comparison between the conventional LLM-based navigation pipeline and our ImagineNav++ pipeline. The traditional LLM-based navigation framework, illustrated on the left, relies on intricate sensor data processing and pose estimation for map creation, followed by LLM-driven reasoning to decide the exploration direction. Instead, our ImagineNav++ directly decomposes the long-horizon object goal navigation task into a sequence of best-view image selection tasks for VLM, which avoids the latency and compounding error in the traditional cascaded methods.
  • Figure 2: The overall pipeline of our mapless, open-vocabulary navigation framework ImagineNav++. At each iteration, the agent captures a panoramic view of its surroundings. The Imagination Module then leverage the trained Where2Imagine module coupled with a novel view synthesis model to generate novel scene views. Guided by structured prompts, the VLM engages in target-oriented inference by integrating historical selective foveation memory with imagined future waypoint observations. Subsequently, the system executes the PointNav policy to determine the next navigational action. The above imagination, reasoning and planning procedure iterates until the target is reached.
  • Figure 3: Visualization of human demonstration trajectories on the MP3D dataset in the Habitat-Web project. The trajectories reveal a consistent tendency for humans to prioritize directions toward semantically meaningful cues (e.g., doors) that structurally facilitate efficient exploration.
  • Figure 4: Visualization of the synthesized image observations at future navigation waypoints predicted by the imagination module. The synthesized images of future waypoints generated by the imagination module exhibit significant semantic variations, in contrast to the consistent semantics of current observations. This semantic diversity demonstrates the module's effectiveness in enhancing VLM decision-making.
  • Figure 5: Where2Imagine: the impact of different sampling interval $\mathbf{T}$ on ObjectNav performance on HM3D dataset.
  • ...and 5 more figures