Table of Contents
Fetching ...

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev

TL;DR

RoomTour3D tackles data scarcity in vision and language navigation by mining real world room tour videos to build geometry aware training data. It reconstructs 3D room layouts with COLMAP, extracts object and depth information, and generates open vocabulary instructions via GPT-4, producing description enriched and action enriched trajectories. These data enable a NaviLLM based embodied agent to achieve state of the art on multiple VLN benchmarks and to perform zero shot navigation, demonstrating strong generalization to open world environments. The dataset and prompts are released to support broad use and further research in embodied AI.

Abstract

Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

TL;DR

RoomTour3D tackles data scarcity in vision and language navigation by mining real world room tour videos to build geometry aware training data. It reconstructs 3D room layouts with COLMAP, extracts object and depth information, and generates open vocabulary instructions via GPT-4, producing description enriched and action enriched trajectories. These data enable a NaviLLM based embodied agent to achieve state of the art on multiple VLN benchmarks and to perform zero shot navigation, demonstrating strong generalization to open world environments. The dataset and prompts are released to support broad use and further research in embodied AI.

Abstract

Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes 100K open-ended description-enriched trajectories with 200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.

Paper Structure

This paper contains 27 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of our RoomTour3D data generation. Starting from a room tour video, we first apply BLIP-2 li2023blip2 on frame sequence to predict the room locations. Next, we use RAM zhang2023recognize and Grounding-DINO liu2023grounding to identify objects within the frames and employ Depth-Anything depthanything for depth prediction. Subsequently, COLMAP is used to reconstruct the 3D scene with complete geometry information, and we sample human walking trajectories from the continuous frames. The trajectory captures open-world objects, their positions, and depths relative to the camera. Finally, we use advanced LLM, i.e., GPT-4 to generate the free-form descriptions for pretraining, namely description-enriched trajectories. Specifically, for the trajectory shown in the figure, which involves instant turning points, we specially treat < 0> to < 6> as walking trajectory, < A> < B> and < C> as side-watching points and use them as negative candidates for navigation finetuning task, namely action-enriched trajectories. For more details, please refer to Section \ref{['sec:data_generatopm']}.
  • Figure 2: Instruction generation in a controllable way. (a) Using open-sourced expert models, we identify what objects are in the frames, and assess how far an object is and determine where an object is located. The information is then textualized to create richly detailed frame captions. (b) BLIP-2 is adopted to predict and smooth room location across sequential frames. (c) Combining room locations and object information, we use GPT-4 for controllable and open-vocabulary instruction generation. The prompt consists of a task instruction that defines the generation task, and in-context examples that constrain the output style.
  • Figure 3: Model training diagram with RoomTour3D. We design two tasks for our RoomTour3D to boost NaviLLM. (a) Pretraining: Sampled frames on the trajectory are treated as candidate observations. Model is optimized to summarize object progression along the path. (b) Finetuning: Each frame is considered as a navigable step. Given historical observation < 0> to < 2> and navigation instruction, the model is prompted to predict the next action by selecting from candidate observations View A, View B, View C and View 4.
  • Figure 4: Paths of NaviLLM navillm and ours on R2R-unseen. Purple and green circles denote the start and target locations, respectively, and the red circle represents incorrect endpoint. According to the instruction, the agent should turn left at the waypoint marked with yellow. Our method makes the correct decision, while the baseline is confused by similar entrance at the waypoint, thus mistakenly turns right.
  • Figure 5: Visualization of significant view change point selection. For each cluster we identify the walking tracks and find the candidate views for the next action selection. This process ensures we have a diversified set of views in the setting without panorama images.
  • ...and 6 more figures