Table of Contents
Fetching ...

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang

TL;DR

ImagineNav presents a mapless, open-vocabulary visual navigation framework that leverages vision-language model imagination to transform long-horizon planning into a sequence of best-view image selections. By coupling Where2Imagine with novel-view synthesis and a VLM-based high-level planner, the method grounds decision-making in imagined future observations and executes with a PointNav controller augmented by VER. Extensive experiments on HM3D and HSSD demonstrate strong performance gains over baselines, with ablations confirming the critical roles of imagination, viewpoint synthesis, and VLM reasoning. This approach reduces dependence on explicit 3D maps and sensor-heavy pipelines, offering robust zero-shot generalization to novel objects and scenes in real-world robotics settings.

Abstract

Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLMs is limited within texts and it is difficult to represent the spatial occupancy and geometry layout only by texts. Both are important for making rational navigation decisions. In this work, we seek to unleash the spatial perception and planning ability of Vision-Language Models (VLMs), and explore whether the VLM, with only on-board camera captured RGB/RGB-D stream inputs, can efficiently finish the visual navigation tasks in a mapless manner. We achieve this by developing the imagination-powered navigation framework ImagineNav, which imagines the future observation images at valuable robot views and translates the complex navigation planning process into a rather simple best-view image selection problem for VLM. To generate appropriate candidate robot views for imagination, we introduce the Where2Imagine module, which is distilled to align with human navigation habits. Finally, to reach the VLM preferred views, an off-the-shelf point-goal navigation policy is utilized. Empirical experiments on the challenging open-vocabulary object navigation benchmarks demonstrates the superiority of our proposed system.

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

TL;DR

ImagineNav presents a mapless, open-vocabulary visual navigation framework that leverages vision-language model imagination to transform long-horizon planning into a sequence of best-view image selections. By coupling Where2Imagine with novel-view synthesis and a VLM-based high-level planner, the method grounds decision-making in imagined future observations and executes with a PointNav controller augmented by VER. Extensive experiments on HM3D and HSSD demonstrate strong performance gains over baselines, with ablations confirming the critical roles of imagination, viewpoint synthesis, and VLM reasoning. This approach reduces dependence on explicit 3D maps and sensor-heavy pipelines, offering robust zero-shot generalization to novel objects and scenes in real-world robotics settings.

Abstract

Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLMs is limited within texts and it is difficult to represent the spatial occupancy and geometry layout only by texts. Both are important for making rational navigation decisions. In this work, we seek to unleash the spatial perception and planning ability of Vision-Language Models (VLMs), and explore whether the VLM, with only on-board camera captured RGB/RGB-D stream inputs, can efficiently finish the visual navigation tasks in a mapless manner. We achieve this by developing the imagination-powered navigation framework ImagineNav, which imagines the future observation images at valuable robot views and translates the complex navigation planning process into a rather simple best-view image selection problem for VLM. To generate appropriate candidate robot views for imagination, we introduce the Where2Imagine module, which is distilled to align with human navigation habits. Finally, to reach the VLM preferred views, an off-the-shelf point-goal navigation policy is utilized. Empirical experiments on the challenging open-vocabulary object navigation benchmarks demonstrates the superiority of our proposed system.

Paper Structure

This paper contains 23 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The comparison between the conventional LLM-based navigation pipeline and our ImagineNav pipeline. The traditional LLM-based navigation framework, illustrated on the left, relies on intricate sensor data processing and pose estimation for map creation, followed by LLM-driven reasoning to decide the exploration direction. Instead, our ImagineNav directly translates the long-horizon object goal navigation task into a sequence of best-view image selection tasks for VLM, which avoids the latency and compounding error in the traditional cascaded methods.
  • Figure 2: The overall pipeline of our mapless, open-vocabulary navigation framework. At each iteration, the agent captures a panoramic view of its surroundings. In the Imagination Module, the trained Where2Imagine module couples with novel view synthesis model to generate novel scene views. Guided by prompt templates, the VLM engages in target-oriented inference. Subsequently, the system executes the PointNav policy to determine the next navigational action. The above imagination, reasoning and planning procedure iterates until the target is reached.
  • Figure 3: An example of the VLM analysis. By examining different future-view scenarios, the VLM pinpoints the direction most likely to incorporate the target object couch.
  • Figure 4: Visualization of the synthesized image observations at future navigation waypoints predicted by the imagination module. It can be seen that there exists drastic semantic disparity between different imaginations. In contrast, the semantic information is relatively consistent across different current observations. The varying semantics across different future views highlight the advantages of the imagination module in enhancing the VLM's decision-making capabilities.
  • Figure 5: Visualization of the navigation trajectory. The top and bottom rows respectively show the complete top-down trajectories of successful and unsuccessful examples.
  • ...and 4 more figures