Table of Contents
Fetching ...

From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Matthew Wallingford, Anand Bhattad, Aditya Kusupati, Vivek Ramanujan, Matt Deitke, Sham Kakade, Aniruddha Kembhavi, Roozbeh Mottaghi, Wei-Chiu Ma, Ali Farhadi

TL;DR

This work tackles the scarcity of large-scale real-world multi-view data for 3D understanding by harvesting 360° video at scale (360-1M) and training a diffusion-based, viewpoint-conditioned model (Odin) that can synthesize novel views with unrestricted camera movement. Odin uses long-range correspondences from 360° video, motion masking to handle dynamic content, and a trajectory-based sampling regime to enable 3D reconstruction from single images. On standard novel view synthesis benchmarks (DTU, Mip-NeRF 360) and a held-out 360-1M set, Odin improves perceptual metrics and demonstrates robust 3D reconstruction performance, surpassing several baselines without task-specific fine-tuning. The work provides a scalable data-and-model pipeline for real-world 3D scene understanding and plans to release the dataset, code, and models to spur further research and applications in AR/VR and robotics.

Abstract

Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.

From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

TL;DR

This work tackles the scarcity of large-scale real-world multi-view data for 3D understanding by harvesting 360° video at scale (360-1M) and training a diffusion-based, viewpoint-conditioned model (Odin) that can synthesize novel views with unrestricted camera movement. Odin uses long-range correspondences from 360° video, motion masking to handle dynamic content, and a trajectory-based sampling regime to enable 3D reconstruction from single images. On standard novel view synthesis benchmarks (DTU, Mip-NeRF 360) and a held-out 360-1M set, Odin improves perceptual metrics and demonstrates robust 3D reconstruction performance, surpassing several baselines without task-specific fine-tuning. The work provides a scalable data-and-model pipeline for real-world 3D scene understanding and plans to release the dataset, code, and models to spur further research and applications in AR/VR and robotics.

Abstract

Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.

Paper Structure

This paper contains 29 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: By learning from the largest real-world, multi-view dataset to date, our model Odin, can synthesize novel views of rich scenes from a single input image with free camera movement throughout the scene. We can then reconstruct the 3D scene geometry from these geometrically consistent generations.
  • Figure 2: Left: An illustrative trajectory of standard video with the view point fixed at the time of capture. The fixed view point makes finding corresponding frames challenging. Right: The trajectory of a 360$^\circ$ video through the scene. The controllable camera enables alignment of views at different frames of the video.
  • Figure 3: Qualitative comparison of novel view synthesis on real-world scenes. The left and right images are conditioned on camera views from the left and right respectively. In the middle scene of the kitchen, Odin accurately models the geometry of the table counter and chairs as well as unseen parts of the scene such as the living room.
  • Figure 4: Examples of generated 3D scenes using Odin. The blue dot indicates the location of the input image and the red lines indicate the trajectory of the camera which generated the images. Odin is capable of long-range generation of geometrically consistent images. In the bottom scene, we see the model accurately infers the geometry of the unseen cathedral ceiling and the long hallway.
  • Figure 5: Video duration distribution in 360-1M.
  • ...and 5 more figures