SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Zhaoxi Chen; Guangcong Wang; Ziwei Liu

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Zhaoxi Chen, Guangcong Wang, Ziwei Liu

TL;DR

<3-5 sentence high-level summary> SceneDreamer tackles unbounded 3D scene generation from in-the-wild 2D image collections without 3D annotations. It introduces a BEV-based height-field $M_h$ and semantic field $M_s$, a semantic-aware generative neural hash grid, and a style-conditioned neural volumetric renderer to produce 3D-consistent photorealistic renderings, learned via adversarial training with camera poses not required. The approach enables efficient, scalable 3D landscape generation with a sliding-window inference and supports high-resolution renderings and perpetual view generation. Demonstrations show superiority over state-of-the-art 3D scene generators in both depth accuracy and multi-view coherence, with applications in large-scale landscape synthesis and scene interpolation.

Abstract

In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

TL;DR

<3-5 sentence high-level summary> SceneDreamer tackles unbounded 3D scene generation from in-the-wild 2D image collections without 3D annotations. It introduces a BEV-based height-field

and semantic field

, a semantic-aware generative neural hash grid, and a style-conditioned neural volumetric renderer to produce 3D-consistent photorealistic renderings, learned via adversarial training with camera poses not required. The approach enables efficient, scalable 3D landscape generation with a sliding-window inference and supports high-resolution renderings and perpetual view generation. Demonstrations show superiority over state-of-the-art 3D scene generators in both depth accuracy and multi-view coherence, with applications in large-scale landscape synthesis and scene interpolation.

Abstract

Paper Structure (51 sections, 6 equations, 17 figures, 6 tables)

This paper contains 51 sections, 6 equations, 17 figures, 6 tables.

Introduction
Related Work
Neural scene representation.
3D-aware GANs.
Scene-level image synthesis.
SceneDreamer
BEV Scene Representation
Height Field.
Semantic Field.
Generative Neural Hash Grid
Unbounded 3D Scene GAN in the Wild
Generator.
Discriminator.
Implementation Details
BEV Scene Representation Generation
...and 36 more sections

Figures (17)

Figure 1: SceneDreamer learns to generate unbounded 3D scenes from in-the-wild 2D image collections. Our method can synthesize diverse landscapes across different styles, with 3D consistency, well-defined depth, and free camera trajectory.
Figure 2: Overview of SceneDreamer. Given a simplex noise $z \sim p_{\mathrm{scene}}$ and a style code $z_\mathrm{style} \sim p_{\mathrm{style}}$ as input, our model is capable of synthesizing large-scale 3D scenes where the camera can move freely and get realistic renderings. We first derive our BEV scene representation which consists of a height field and a semantic field. Then, we use a generative neural hash grid to parameterize the hyperspace of space-varied and scene-varied latent features given scene semantics $\bm{f}_{s}$ and 3D position $\bm{x}$. Finally, a style-modulated renderer is employed to blend latent features $\bm{f}_{\bm{x}}$ and render 2D images via volume rendering.
Figure 3: Diverse samples of SceneDreamer. Our model can synthesize a large variety of 3D scenes with diverse styles, from winter to summer and dawn to dusk. Please check the supplementary and project page for 3D consistent videos.
Figure 4: Sliding window mechanism to generate unbounded scenes beyond training resolution. Given a scene with its BEV scene representation size of $10240\times10240$, we first generate the BEV maps for the entire world, then bind the local scene window (highlighted rectangles) to the camera position (orange). Given a fly-through camera trajectory, the local scene window slides accordingly to render coherent frames (bottom).
Figure 5: Procedural Generation of BEV Scene Representation. The feed-forward mapping from random noise $z$ to the BEV scene representation (Sec. \ref{['sec:orthogonal']}), i.e., $z \xrightarrow{} (M_h, M_s)$, can be either learned from data or parameter-free. Our instantiation starts from 2D simplex noises (highlighted in orange background). Please refer to Sec. \ref{['sec:pcg']} for details.
...and 12 more figures

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

TL;DR

Abstract

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Authors

TL;DR

Abstract

Table of Contents

Figures (17)