WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Dujun Nie; Xianda Guo; Yiqun Duan; Ruijun Zhang; Long Chen

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, Long Chen

TL;DR

WMNav presents a fully modular world-model framework that embeds Vision-Language Models into predictive planning for zero-shot Object Goal Navigation. It introduces an online Curiosity Value Map to store predicted environmental states and a subtask decomposition to provide denser, reward-like feedback for prompts. A two-stage action proposer steers exploration and precise localization using ReasonVLM and PlanVLM without task-specific fine-tuning, yielding state-of-the-art results on HM3D and MP3D. This approach demonstrates the practical potential of VLM-driven world models to improve planning efficiency and robustness in complex indoor environments.

Abstract

Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

TL;DR

Abstract

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)