Unconstrained Scene Generation with Locally Conditioned Radiance Fields
Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W. Taylor, Joshua M. Susskind
TL;DR
Generative Scene Networks (GSN) address the lack of a scene prior in radiance-field rendering by decomposing unconstrained indoor scenes into a grid of locally conditioned radiance fields that render via volumetric rendering from a freely moving camera. A Global Generator produces a 2D latent grid $W$ that spatially organizes local latents, each conditioning a NeRF-like radiance field $f$ to model $\sigma$ and $\mathbf{a}$ in local coordinates; rendering integrates along rays and uses a refinement network to generate high-quality images. Across VizDoom, Replica, and AVD, GSN achieves substantial improvements in $\text{FID}$ and $\text{SwAV-FID}$ over prior generative radiance-field models, and ablations confirm the benefits of local conditioning, local coordinates, sufficient trajectory length, depth usage, and discriminative regularization. The model supports unconditional sampling, conditional scene completion via inversion, and coherent view synthesis, with implications for world models, SLAM, AR/VR, and 3D content creation.
Abstract
We tackle the challenge of learning a distribution over complex, realistic, indoor scenes. In this paper, we introduce Generative Scene Networks (GSN), which learns to decompose scenes into a collection of many local radiance fields that can be rendered from a free moving camera. Our model can be used as a prior to generate new scenes, or to complete a scene given only sparse 2D observations. Recent work has shown that generative models of radiance fields can capture properties such as multi-view consistency and view-dependent lighting. However, these models are specialized for constrained viewing of single objects, such as cars or faces. Due to the size and complexity of realistic indoor environments, existing models lack the representational capacity to adequately capture them. Our decomposition scheme scales to larger and more complex scenes while preserving details and diversity, and the learned prior enables high-quality rendering from viewpoints that are significantly different from observed viewpoints. When compared to existing models, GSN produces quantitatively higher-quality scene renderings across several different scene datasets.
