SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections
Zhaoxi Chen, Guangcong Wang, Ziwei Liu
TL;DR
<3-5 sentence high-level summary> SceneDreamer tackles unbounded 3D scene generation from in-the-wild 2D image collections without 3D annotations. It introduces a BEV-based height-field $M_h$ and semantic field $M_s$, a semantic-aware generative neural hash grid, and a style-conditioned neural volumetric renderer to produce 3D-consistent photorealistic renderings, learned via adversarial training with camera poses not required. The approach enables efficient, scalable 3D landscape generation with a sliding-window inference and supports high-resolution renderings and perpetual view generation. Demonstrations show superiority over state-of-the-art 3D scene generators in both depth accuracy and multi-view coherence, with applications in large-scale landscape synthesis and scene interpolation.
Abstract
In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
