Table of Contents
Fetching ...

MegaScenes: Scene-Level View Synthesis at Scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, Noah Snavely

TL;DR

The paper addresses the paucity of diverse scene-level training data for 3D-aware novel view synthesis and demonstrates that finetuning diffusion-based NVS models on MegaScenes substantially improves generalization to in-the-wild scenes. It introduces MegaScenes—a large-scale dataset of ~430K scenes with ~9M images and ~100K SfM reconstructions derived from Wikimedia Commons—paired into over 2M training image pairs with known relative poses. The authors augment prior pose-conditioned diffusion models with warp conditioning that warps the input view into the target view and incorporates extrinsic matrices to enforce correct scale, yielding more pose-consistent and realistic outputs. Across in-domain MegaScenes evaluation and cross-domain tests on DTU, Mip-NeRF360, and RealEstate10K, the approach achieves superior pose alignment and visual fidelity, validating the dataset and method's effectiveness for scene-level NVS.

Abstract

Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at https://megascenes.github.io.

MegaScenes: Scene-Level View Synthesis at Scale

TL;DR

The paper addresses the paucity of diverse scene-level training data for 3D-aware novel view synthesis and demonstrates that finetuning diffusion-based NVS models on MegaScenes substantially improves generalization to in-the-wild scenes. It introduces MegaScenes—a large-scale dataset of ~430K scenes with ~9M images and ~100K SfM reconstructions derived from Wikimedia Commons—paired into over 2M training image pairs with known relative poses. The authors augment prior pose-conditioned diffusion models with warp conditioning that warps the input view into the target view and incorporates extrinsic matrices to enforce correct scale, yielding more pose-consistent and realistic outputs. Across in-domain MegaScenes evaluation and cross-domain tests on DTU, Mip-NeRF360, and RealEstate10K, the approach achieves superior pose alignment and visual fidelity, validating the dataset and method's effectiveness for scene-level NVS.

Abstract

Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at https://megascenes.github.io.
Paper Structure (27 sections, 9 figures, 3 tables)

This paper contains 27 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The MegaScenes Dataset is an extensive collection of around 430k scenes, featuring over 100k structure-from-motion reconstructions and over 2 million registered images. MegaScenes includes a diverse array of scenes, such as minarets (e.g., Qutb Minar), building interiors (e.g., wooden church in Călinești Căeni), statues (e.g., Puits de Moïse), bridges (e.g., Kapellbrücke), towers (e.g., Puerta de Europa), religious buildings (e.g., Karmravor), and natural landscapes (e.g., Teide volcano). The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.
  • Figure 2: MegaScenes curation pipeline. We first source and identify potential scene categories from Wikidata. Subsequently, images and metadata for each scene category is downloaded. Finally, we reconstruct scenes using Structure from Motion (SfM) and clean them using the Doppelgangers cai2023doppelgangers pipeline.
  • Figure 3: Distribution of the MegaScenes Dataset. On the left, we depict the frequency of scenes grouped by Wikidata class. This includes only select classes with more than 3,500 scenes; note that a single scene may be an instance of multiple classes. On the right, we visualize the geospatial distribution of collected scenes worldwide.
  • Figure 4: We create over 2 million pairs of training images for novel view synthesis. Each pair contains relative pose and a warping from the reference to target image which we use for both training and evaluation. We align estimated monocular depths with sparse point clouds from COLMAP schoenberger2016sfm, and unproject the RGBD images to a mesh for viewpoint rendering. See \ref{['sec:nvssetup']} for details and \ref{['fig:megascenesfig']} for more examples.
  • Figure 5: We evaluate multiple baselines on MegaScenes, which contains diverse scenes, poses, and object compositions. Prior methods exhibit many failure modes in this challenging setting. Our method identifies and addresses these failure modes.
  • ...and 4 more figures