Table of Contents
Fetching ...

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

Mingfei Han, Haihong Hao, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev

TL;DR

A large-scale video-instruction framework derived from web-based room tour videos is introduced, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings, and integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D.

Abstract

Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

TL;DR

A large-scale video-instruction framework derived from web-based room tour videos is introduced, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings, and integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D.

Abstract

Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.
Paper Structure (28 sections, 6 figures, 6 tables)

This paper contains 28 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of our origin RoomTour3D with COLMAP reconstruction and explicit geometry information. Starting from a room tour video, we first apply BLIP-2 li2023blip2 on frame sequence to predict the room locations. Next, we use RAM zhang2023recognize and Grounding-DINO liu2023grounding to identify objects within the frames and employ Depth-Anything depthanything for depth prediction. Subsequently, COLMAP is used to reconstruct the 3D scene with complete geometry information, and we sample human walking trajectories from the continuous frames. The trajectory captures open-world objects, their positions, and depths relative to the camera. Finally, we use advanced LLM, i.e., GPT-4 to generate the free-form descriptions for pretraining, namely description-enriched trajectories. Specifically, for the trajectory shown in the figure, which involves instant turning points, we specially treat < 0> to < 6> as walking trajectory, < A> < B> and < C> as side-watching points and use them as negative candidates for navigation finetuning task, namely action-enriched trajectories. For more details, please refer to Section \ref{['sec:data_generatopm']}.
  • Figure 2: Instruction generation in a controllable way. (a) Using open-sourced expert models, we identify what objects are in the frames, and assess how far an object is and determine where an object is located. The information is then textualized to create richly detailed frame captions. (b) BLIP-2 is adopted to predict and smooth room location across sequential frames. (c) Combining room locations and object information, we use GPT-4 for controllable and open-vocabulary instruction generation. The prompt consists of a task instruction that defines the generation task, and in-context examples that constrain the output style.
  • Figure 3: Model training diagram using RoomTour3D. Two complementary tasks are designed to enhance NaviLLM: (a) Pretraining. Sampled frames along the trajectory serve as candidate observations, and the model is optimized to summarize object progression along the path. (b) Finetuning. Each frame acts as a navigable step. Given the historical observations < 0>–< 2> and a navigation instruction, the model predicts the next action by choosing among candidate views (A–D).
  • Figure 4: Overview of implicit-geometry training. Our RoomTour3D-IGR processes both explicit geometry from simulators and implicit geometry from RoomTour videos, alongside task instructions. (a) For simulator data, the pipeline follows the original RoomTour3D setup: RGB observations are encoded by the scene encoder, while explicit geometric features (e.g., distance and heading) are incorporated. For instance, <2>–< A>: 1.6 m, 88° denotes a spatial relation guiding the agent’s next action. (b) For RoomTour videos, frames 1–2 form the trajectory history, and frame 3 acts as a navigation candidate. RGB frames are encoded by the scene encoder, while a VGGT-based spatial encoder extracts implicit geometric features. These embeddings are projected via a spatial projector into the LLM’s latent space to guide accurate action prediction.
  • Figure 5: Visual robustness evaluation under common degradations on R2R Val Unseen. The figure illustrates four types of perturbations (Gaussian noise, motion blur, JPEG compression, defocus blur, and brightness) with image examples, together with SPL and SR results. Compared to NaviLLM, our RoomTour3D-trained agent suffers smaller performance drops under all degradations, highlighting improved tolerance in real-world.
  • ...and 1 more figures