Table of Contents
Fetching ...

Generating 3D-Consistent Videos from Unposed Internet Photos

Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely

TL;DR

This work designs a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.

Abstract

We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Generating 3D-Consistent Videos from Unposed Internet Photos

TL;DR

This work designs a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.

Abstract

We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Paper Structure

This paper contains 12 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given $n$ unposed input keyframes, the goal is to generate a video of the scene with a realistic camera trajectory and consistent geometry. From top to bottom: Ours, Luma Dream Machine luma (a commercial video generation model), FILM reda2022film (a frame interpolation method). Luma hallucinates new buildings (left scene) and statues (right scene) without understanding the scene layout. FILM is unable to handle wide-baseline inputs and produces blurry transitions. See our https://genechou.com/kfcw for video playback.
  • Figure 2: Training objectives. Left: Multiview inpainting. We provide $n$ condition images and one target image to a diffusion model. We add noise to 80% of the target following the diffusion process. The condition images and remaining 20% of the target are kept clean. Note how some regions in the target are not seen in the conditions. The model learns priors such as symmetry to generate a plausible image. Right: View interpolation. We take $k$ images from a video sequence and add noise to frame 2 to $k-1$ following the diffusion process. The model generates a sequence following a plausible camera path connecting the first and last frames.
  • Figure 3: Multiview inpainting of internet photos and view interpolation of videos can be unified under the same denoising objective. Left: Training. We denoise the noisy patches (masked patches in multiview inpainting and intermediate frames in view interpolation), while extracting visual information from clean patches (blue patches) via self-attention. Then, we calculate a loss between the denoised (orange) and ground-truth patches. This process operates in latent space. Right: Inference. Given unposed images of the same scene, we initialize and denoise a fixed number of frames via DDIM.
  • Figure 4: Top two rows: We control illumination by conditioning on the CLIP embedding of the red-bordered image during inference. Bottom: Without this condition, illumination varies across frames.
  • Figure 5: Example scene from our user study interface. We provided detailed descriptions for three criteria: Consistency, CameraPath, and Aesthetics. For each scene, users are asked to express a preference between our results and those of a random baseline.
  • ...and 3 more figures