Table of Contents
Fetching ...

MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

Shuo Wang, Yongcai Wang, Zhaoxin Fan, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Wanting Li, Xudong Cai, Yeying Jin, Deying Li

TL;DR

MonoDream tackles monocular Vision-Language Navigation by introducing a Unified Navigation Representation that fuses instructions with past monocular visuals, enabling global and future-aware reasoning. It adds Latent Panoramic Dreaming as training-time supervision to align the learned latent space with panoramic RGB-D features for current and near-future steps, without requiring panoramic sensors during inference. Through multi-task co-training that includes action prediction and instruction reasoning, MonoDream achieves state-of-the-art monocular VLN-CE performance on R2R-CE and RxR-CE benchmarks with excellent data efficiency and cross-dataset generalization. The approach is lightweight and inference-efficient, and Ablation studies show LPD provides the largest gains, with single-step future prediction balancing accuracy and uncertainty.

Abstract

Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

TL;DR

MonoDream tackles monocular Vision-Language Navigation by introducing a Unified Navigation Representation that fuses instructions with past monocular visuals, enabling global and future-aware reasoning. It adds Latent Panoramic Dreaming as training-time supervision to align the learned latent space with panoramic RGB-D features for current and near-future steps, without requiring panoramic sensors during inference. Through multi-task co-training that includes action prediction and instruction reasoning, MonoDream achieves state-of-the-art monocular VLN-CE performance on R2R-CE and RxR-CE benchmarks with excellent data efficiency and cross-dataset generalization. The approach is lightweight and inference-efficient, and Ablation studies show LPD provides the largest gains, with single-step future prediction balancing accuracy and uncertainty.

Abstract

Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

Paper Structure

This paper contains 28 sections, 8 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Overview of the MonoDream framework. MonoDream employs a Vision-Language Action (VLA) framework to encode both visual observations and textual instructions into a Unified Navigation Representation (UNR). The Action Prediction task generates the next action in natural language and is trained with action loss. The Latent Panoramic Dreaming (LPD) encourages the model to internally imagine the latent features of panoramic RGB-D images of current and future steps, providing global visual and geometric context via feature-basd loss. This multi-task co-training enables monocular agents to reason beyond the limited field of view and make more informed navigation decisions.
  • Figure 2: Quantitative results of MonoDream. We compare MonoDream with the ablated variant w/o LPD. Green arrows indicate correct actions, and red arrows indicate errors. (A) MonoDream correctly identifies the hard turning point at the fourth frame. In contrast, the w/o LPD baseline misreads the hallway layout, proceeds straight, and stops in the wrong room. (B) The w/o LPD model makes a critical mistake at the very first step, while MonoDream by leveraging internalized global features from LPD, correctly turns left even without explicit corner information in the initial monocular view.