Table of Contents
Fetching ...

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Zhaoxi Chen, Guangcong Wang, Ziwei Liu

TL;DR

<3-5 sentence high-level summary> SceneDreamer tackles unbounded 3D scene generation from in-the-wild 2D image collections without 3D annotations. It introduces a BEV-based height-field $M_h$ and semantic field $M_s$, a semantic-aware generative neural hash grid, and a style-conditioned neural volumetric renderer to produce 3D-consistent photorealistic renderings, learned via adversarial training with camera poses not required. The approach enables efficient, scalable 3D landscape generation with a sliding-window inference and supports high-resolution renderings and perpetual view generation. Demonstrations show superiority over state-of-the-art 3D scene generators in both depth accuracy and multi-view coherence, with applications in large-scale landscape synthesis and scene interpolation.

Abstract

In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

TL;DR

<3-5 sentence high-level summary> SceneDreamer tackles unbounded 3D scene generation from in-the-wild 2D image collections without 3D annotations. It introduces a BEV-based height-field and semantic field , a semantic-aware generative neural hash grid, and a style-conditioned neural volumetric renderer to produce 3D-consistent photorealistic renderings, learned via adversarial training with camera poses not required. The approach enables efficient, scalable 3D landscape generation with a sliding-window inference and supports high-resolution renderings and perpetual view generation. Demonstrations show superiority over state-of-the-art 3D scene generators in both depth accuracy and multi-view coherence, with applications in large-scale landscape synthesis and scene interpolation.

Abstract

In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
Paper Structure (51 sections, 6 equations, 17 figures, 6 tables)

This paper contains 51 sections, 6 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: SceneDreamer learns to generate unbounded 3D scenes from in-the-wild 2D image collections. Our method can synthesize diverse landscapes across different styles, with 3D consistency, well-defined depth, and free camera trajectory.
  • Figure 2: Overview of SceneDreamer. Given a simplex noise $z \sim p_{\mathrm{scene}}$ and a style code $z_\mathrm{style} \sim p_{\mathrm{style}}$ as input, our model is capable of synthesizing large-scale 3D scenes where the camera can move freely and get realistic renderings. We first derive our BEV scene representation which consists of a height field and a semantic field. Then, we use a generative neural hash grid to parameterize the hyperspace of space-varied and scene-varied latent features given scene semantics $\bm{f}_{s}$ and 3D position $\bm{x}$. Finally, a style-modulated renderer is employed to blend latent features $\bm{f}_{\bm{x}}$ and render 2D images via volume rendering.
  • Figure 3: Diverse samples of SceneDreamer. Our model can synthesize a large variety of 3D scenes with diverse styles, from winter to summer and dawn to dusk. Please check the supplementary and project page for 3D consistent videos.
  • Figure 4: Sliding window mechanism to generate unbounded scenes beyond training resolution. Given a scene with its BEV scene representation size of $10240\times10240$, we first generate the BEV maps for the entire world, then bind the local scene window (highlighted rectangles) to the camera position (orange). Given a fly-through camera trajectory, the local scene window slides accordingly to render coherent frames (bottom).
  • Figure 5: Procedural Generation of BEV Scene Representation. The feed-forward mapping from random noise $z$ to the BEV scene representation (Sec. \ref{['sec:orthogonal']}), i.e., $z \xrightarrow{} (M_h, M_s)$, can be either learned from data or parameter-free. Our instantiation starts from 2D simplex noises (highlighted in orange background). Please refer to Sec. \ref{['sec:pcg']} for details.
  • ...and 12 more figures