Table of Contents
Fetching ...

Unconstrained Scene Generation with Locally Conditioned Radiance Fields

Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W. Taylor, Joshua M. Susskind

TL;DR

Generative Scene Networks (GSN) address the lack of a scene prior in radiance-field rendering by decomposing unconstrained indoor scenes into a grid of locally conditioned radiance fields that render via volumetric rendering from a freely moving camera. A Global Generator produces a 2D latent grid $W$ that spatially organizes local latents, each conditioning a NeRF-like radiance field $f$ to model $\sigma$ and $\mathbf{a}$ in local coordinates; rendering integrates along rays and uses a refinement network to generate high-quality images. Across VizDoom, Replica, and AVD, GSN achieves substantial improvements in $\text{FID}$ and $\text{SwAV-FID}$ over prior generative radiance-field models, and ablations confirm the benefits of local conditioning, local coordinates, sufficient trajectory length, depth usage, and discriminative regularization. The model supports unconditional sampling, conditional scene completion via inversion, and coherent view synthesis, with implications for world models, SLAM, AR/VR, and 3D content creation.

Abstract

We tackle the challenge of learning a distribution over complex, realistic, indoor scenes. In this paper, we introduce Generative Scene Networks (GSN), which learns to decompose scenes into a collection of many local radiance fields that can be rendered from a free moving camera. Our model can be used as a prior to generate new scenes, or to complete a scene given only sparse 2D observations. Recent work has shown that generative models of radiance fields can capture properties such as multi-view consistency and view-dependent lighting. However, these models are specialized for constrained viewing of single objects, such as cars or faces. Due to the size and complexity of realistic indoor environments, existing models lack the representational capacity to adequately capture them. Our decomposition scheme scales to larger and more complex scenes while preserving details and diversity, and the learned prior enables high-quality rendering from viewpoints that are significantly different from observed viewpoints. When compared to existing models, GSN produces quantitatively higher-quality scene renderings across several different scene datasets.

Unconstrained Scene Generation with Locally Conditioned Radiance Fields

TL;DR

Generative Scene Networks (GSN) address the lack of a scene prior in radiance-field rendering by decomposing unconstrained indoor scenes into a grid of locally conditioned radiance fields that render via volumetric rendering from a freely moving camera. A Global Generator produces a 2D latent grid that spatially organizes local latents, each conditioning a NeRF-like radiance field to model and in local coordinates; rendering integrates along rays and uses a refinement network to generate high-quality images. Across VizDoom, Replica, and AVD, GSN achieves substantial improvements in and over prior generative radiance-field models, and ablations confirm the benefits of local conditioning, local coordinates, sufficient trajectory length, depth usage, and discriminative regularization. The model supports unconditional sampling, conditional scene completion via inversion, and coherent view synthesis, with implications for world models, SLAM, AR/VR, and 3D content creation.

Abstract

We tackle the challenge of learning a distribution over complex, realistic, indoor scenes. In this paper, we introduce Generative Scene Networks (GSN), which learns to decompose scenes into a collection of many local radiance fields that can be rendered from a free moving camera. Our model can be used as a prior to generate new scenes, or to complete a scene given only sparse 2D observations. Recent work has shown that generative models of radiance fields can capture properties such as multi-view consistency and view-dependent lighting. However, these models are specialized for constrained viewing of single objects, such as cars or faces. Due to the size and complexity of realistic indoor environments, existing models lack the representational capacity to adequately capture them. Our decomposition scheme scales to larger and more complex scenes while preserving details and diversity, and the learned prior enables high-quality rendering from viewpoints that are significantly different from observed viewpoints. When compared to existing models, GSN produces quantitatively higher-quality scene renderings across several different scene datasets.

Paper Structure

This paper contains 23 sections, 3 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Scenes sampled from our learned prior, rendered from freely moving camera paths containing various rotation, translation, and forward/backward motions.
  • Figure 2: Architecture of the GSN generator. We sample a latent code $\textbf{z}~\sim~p_z$ that is fed to our global generator $g$ producing a local latent grid $\textbf{W}$. This local latent grid $\textbf{W}$ conceptually represents a latent scene "floorplan" and is used for locally conditioning a radiance field $f$ from which images are rendered via volumetric rendering. For a given point $\textbf{p}$ expressed in a global coordinate system to be rendered, we sample $\textbf{W}$ at the location $(i, j)$, given by $\textbf{p}$ resulting in $\textbf{w}_{ij}$. In turn $f$ takes as input $\textbf{p}'$ which results from expressing $\textbf{p}$ relative to the local coordinate system of $\textbf{w}_{ij}$.
  • Figure 3: (a) Architecture of global generator $g$. We use a mapping network, modulated convolutional blocks, and a learned constant input as in StyleGAN2 stylegan2. (b) Architecture of the locally conditioned radiance field network $f$. Latent code $\textbf{w}_{ij}$, sampled from $\textbf{W}$, is used to modulate linear layers, similar to CIPS modfc.
  • Figure 4: Random trajectories through scenes generated by GSN. Models are trained on VizDoom vizdoom (left), Replica replica (right) at $64\times64$ resolution. We omit qualitative results for AVD avd due to unclear licensing terms regarding reproduction of figures for this dataset.
  • Figure 5: Two example latent interpolations between global latent codes $\textbf{z}$. Scenes transition smoothly by aligning geometry features such as walls (top) and appearance features such as the picture frame and doorway (bottom). Views are rendered from a fixed camera pose.
  • ...and 13 more figures