Table of Contents
Fetching ...

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, Long Chen

TL;DR

WMNav presents a fully modular world-model framework that embeds Vision-Language Models into predictive planning for zero-shot Object Goal Navigation. It introduces an online Curiosity Value Map to store predicted environmental states and a subtask decomposition to provide denser, reward-like feedback for prompts. A two-stage action proposer steers exploration and precise localization using ReasonVLM and PlanVLM without task-specific fine-tuning, yielding state-of-the-art results on HM3D and MP3D. This approach demonstrates the practical potential of VLM-driven world models to improve planning efficiency and robustness in complex indoor environments.

Abstract

Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

TL;DR

WMNav presents a fully modular world-model framework that embeds Vision-Language Models into predictive planning for zero-shot Object Goal Navigation. It introduces an online Curiosity Value Map to store predicted environmental states and a subtask decomposition to provide denser, reward-like feedback for prompts. A two-stage action proposer steers exploration and precise localization using ReasonVLM and PlanVLM without task-specific fine-tuning, yielding state-of-the-art results on HM3D and MP3D. This approach demonstrates the practical potential of VLM-driven world models to improve planning efficiency and robustness in complex indoor environments.

Abstract

Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

Paper Structure

This paper contains 22 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: World Model Navigation with VLM. In object navigation, our model first estimates the goal's presence likelihood in each scene of the panoramic image (e.g., a bed typically resides in a living room, which is likely situated at the corridor's terminus), then plans intermediate subtasks and chooses the most appropriate action to execute.
  • Figure 2: The WMNav framework. After acquiring the RGB-D panoramic image and pose information at step $t$, the PredictVLM first predicts the state of the world, and the state is merged with the curiosity value map $s_{t-1}$ from the previous step to get the current curiosity value map $s_t$. After that, the updated map projects the scores of each direction back onto the panoramic image, and the direction with the highest score is selected. Secondly, given the selected direction image, the new subtask and the goal flag are determined by PlanVLM and are stored in memory as cost $c_t$, and the memory $h_t$ is combined by $s_t$ and $c_t$. Finally, the two-stage action proposer annotates the action sequence on the selected image and sends it into ReasonVLM to obtain the final polar coordinate vector action $a_t$ for execution. Note that PlanVLM and ReasonVLM are configured by the cost $c_{t-1}$.
  • Figure 3: Predict the Likelihood. (a) The world model predicts the Curiosity Value for each direction in the panoramic image based on the likelihood of the goal's presence. (b) The mutual projection of the navigable area between the ego-centric and top-down view perspectives. (c) Curiosity Value Map construction: The predicted scores from the world model are projected onto the top-down map and then fused with the previous step's Curiosity Value Map.
  • Figure 4: Plan the Route. Text prompt is configured by the previous step's subtask, the explanation for selecting the highest-scoring image, and the goal. Using the image with the highest curiosity value as the image prompt, the VLM is invoked to plan the agent's new subtask and detect the goal.
  • Figure 5: Reason the Action. In the exploration stage, the agent uses the action proposer to filter sampled actions. ActionVLM(obtained by configuring ReasonVLM) selects the most appropriate action for execution from the image with a labeled candidate action sequence, continuing until the target is found and the agent shifts to the next stage. The next stage is the goal-approaching stage. The agent uses the Goal Proposer to densely sample actions from the image. The GoalVLM(also obtained by configuring ReasonVLM) then selects the action that best represents the goal location.