Table of Contents
Fetching ...

NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, Mu Xu

TL;DR

NavForesee tackles long-horizon embodied navigation by unifying Vision-Language Model planning with a dual-horizon world model. It introduces hierarchical language planning to decompose instructions and milestone-based progress tracking, coupled with short-term and long-term predictive foresight that imagines depth and semantic features. The approach is implemented on a Qwen2.5-VL backbone with dream-query mechanisms and trained via a joint, multi-task objective. Empirical results on R2R-CE and RxR-CE show competitive SR and OSR, with ablations confirming the value of both planning and dual-horizon predictions for robust navigation in unseen environments.

Abstract

Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.

NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction

TL;DR

NavForesee tackles long-horizon embodied navigation by unifying Vision-Language Model planning with a dual-horizon world model. It introduces hierarchical language planning to decompose instructions and milestone-based progress tracking, coupled with short-term and long-term predictive foresight that imagines depth and semantic features. The approach is implemented on a Qwen2.5-VL backbone with dream-query mechanisms and trained via a joint, multi-task objective. Empirical results on R2R-CE and RxR-CE show competitive SR and OSR, with ablations confirming the value of both planning and dual-horizon predictions for robust navigation in unseen environments.

Abstract

Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.

Paper Structure

This paper contains 23 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: VLM-driven hierarchical navigation plan dataset generation. Episodes from R2R-CE and RxR-CE are processed by Gemini 2.5 Pro, which decomposes long instructions into sub-instructions and identifies keyframe milestones. To generation of waypoint-level reasoning labels, waypoints are sampled between milestones annotated with a navigation summary, future plan, and action (forward, left, right, or stop).
  • Figure 2: Overall architecture of NavForesee. The model is built on the Qwen2.5-VL-3B-Instruct backbone, integrating two complementary functionalities: (1) VLM-based hierarchical planning and (2) world model-based dual-horizon visual prediction. For hierarchical planning, textual instruction and visual observations are encoded via Qwen’s original multimodal encoders to produce auto-regressive sub-goal plans. For prediction, a position encoder encodes the agent’s relative pose, and short- and long-horizon dream queries (depth and semantic subqueries) are appended to multimodal embeddings. These queries, processed through structured attention, feed lightweight convolutional decoders for environmental predictions and an MLP head for navigation actions.
  • Figure 3: Short-term depth and semantics predictions. From top to bottom: frames with timestamps, future ground truth frames with timestamps, future depth prediction for future frames, semantics predictions for future frames. Semantic features are DinoV2 features and visualized by a pretrained segmentation head. Instructions: UP the stairs. Turn to the left and enter into the second open door on the left. Walk towards the foot of the bed. Turn right and enter the open door to the bathroom
  • Figure 4: NavForesee’s geometric-semantic feature imagination across different motion modes. The model accurately predicts environmental dynamics in straight motion, generalizes effectively to turning scenarios, and infers detailed object geometry and depth distribution from minimal visual input, such as a brief glimpse into a room
  • Figure 5: Hierarchical planning examples generated by NavForesee for the instruction "Go up the stairs and straight forward the doorway. Turn right, move forward, and enter the doorway on the right. Move forward into the bedroom and stop in front of the toilet". From top to bottom: frames with timestamps, global navigation map, and navigation planning outputs. NavForesee accurately identifies milestones along the route, summarizes completed sub-instructions, and generates the next sub-instruction in accordance with the instruction context.
  • ...and 1 more figures