Table of Contents
Fetching ...

MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data

Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, Jiuxiang Gu, Qixing Huang, Georgios Pavlakos, Hao Tan

TL;DR

MegaSynth addresses the data bottleneck in wide-coverage 3D scene reconstruction by introducing a non-semantic, procedurally generated 700K-scene dataset that scales training data for Large Reconstruction Models (LRMs). The method leverages controllable geometry, textures, and lighting, paired with camera-pooled dense-view rendering, and employs mixed-data training (MegaSynth and real data) with photometric and geometry losses to learn robust 3D priors. Empirical results show 1.2–1.8 dB PSNR improvements across indoor/outdoor and in/out-of-domain tests, significant depth-rendering gains, and orders-of-magnitude faster inference than optimization-based baselines, with MegaSynth-alone sometimes matching real-data training. The findings demonstrate that semantics are not essential for multi-view 3D reconstruction, enabling scalable data generation that enhances generalization and can transfer to other 3D tasks.

Abstract

We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K scenes - over 50 times larger than the prior real dataset DL3DV - dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data. Experiment results show that joint training or pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB PSNR across diverse image domains. Moreover, models trained solely on MegaSynth perform comparably to those trained on real data, underscoring the low-level nature of 3D reconstruction. Additionally, we provide an in-depth analysis of MegaSynth's properties for enhancing model capability, training stability, and generalization, as well as application to other tasks.

MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data

TL;DR

MegaSynth addresses the data bottleneck in wide-coverage 3D scene reconstruction by introducing a non-semantic, procedurally generated 700K-scene dataset that scales training data for Large Reconstruction Models (LRMs). The method leverages controllable geometry, textures, and lighting, paired with camera-pooled dense-view rendering, and employs mixed-data training (MegaSynth and real data) with photometric and geometry losses to learn robust 3D priors. Empirical results show 1.2–1.8 dB PSNR improvements across indoor/outdoor and in/out-of-domain tests, significant depth-rendering gains, and orders-of-magnitude faster inference than optimization-based baselines, with MegaSynth-alone sometimes matching real-data training. The findings demonstrate that semantics are not essential for multi-view 3D reconstruction, enabling scalable data generation that enhances generalization and can transfer to other 3D tasks.

Abstract

We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K scenes - over 50 times larger than the prior real dataset DL3DV - dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data. Experiment results show that joint training or pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB PSNR across diverse image domains. Moreover, models trained solely on MegaSynth perform comparably to those trained on real data, underscoring the low-level nature of 3D reconstruction. Additionally, we provide an in-depth analysis of MegaSynth's properties for enhancing model capability, training stability, and generalization, as well as application to other tasks.

Paper Structure

This paper contains 24 sections, 3 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: We introduce MegaSynth, a non-semantic synthesized dataset for training LRMs. MegaSynth benefits from its scalability and controllability, enabling us to generate 700K scenes in 3 days. We train LRMs with both the large-scale MegaSynth data and small-scale real data, improving LRMs for reconstructing wide-coverage scenes from dense-view images.
  • Figure 2: MegaSynth generation pipeline. We first generate the scene floor plan, where each 3D box represents a shape and different colors represent different object types. We compose shape primitives into objects with geometry augmentations, where these objects further compose the scene. We randomize the texture and lighting, and generate random cameras for rendering.
  • Figure 3: Reconstruction visualization on the in-domain DL3DV data. The results are from Long-LRM at resolution 256. We present both indoor and outdoor results in the first and second rows, respectively. With our MegaSynth (denoted as 'w. MegaSynth'), the model performs better on thin structures (e.g., bottom left), complicated lighting (e.g., top middle), and cluttered scenes (e.g., top right).
  • Figure 4: Reconstruction visualization on the out-of-domain data. The results are from Long-LRM at resolution 256. We include results for both Hypersim and MipNeRF360 are presented in the first and second rows, respectively.
  • Figure 5: Visualizaton of input views (first row of each example), render target view and ground-truth target views (last two rows of each example). We include results on the DL3DV benchmark data.
  • ...and 1 more figures