Table of Contents
Fetching ...

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Vincent Sitzmann, Michael Zollhöfer, Gordon Wetzstein

TL;DR

<3-5 sentence high-level summary> SRNs address the challenge of learning 3D scene structure from 2D images by introducing a continuous, 3D-aware scene representation Phi coupled with a differentiable renderer Theta that uses differentiable ray marching. This framework enforces true 3D structure and multi-view consistency while enabling high-resolution rendering without explicit 3D supervision, and it generalizes across scenes via latent codes and a hypernetwork. The authors demonstrate strong novel view synthesis, few-shot reconstruction, and latent-space interpolation on Shepard-Metzler and ShapeNet v2, including unsupervised discovery of non-rigid facial geometry and room-scale scene modeling. The work advances 3D-structure-aware neural representations by combining explicit 3D geometry with learnable appearance, enabling scalable 3D vision and graphics from 2D supervision.

Abstract

Unsupervised learning with generative models has the potential of discovering rich representations of 3D scenes. While geometric deep learning has explored 3D-structure-aware representations of scene geometry, these models typically require explicit 3D supervision. Emerging neural scene representations can be trained only with posed 2D images, but existing methods ignore the three-dimensional structure of scenes. We propose Scene Representation Networks (SRNs), a continuous, 3D-structure-aware scene representation that encodes both geometry and appearance. SRNs represent scenes as continuous functions that map world coordinates to a feature representation of local scene properties. By formulating the image formation as a differentiable ray-marching algorithm, SRNs can be trained end-to-end from only 2D images and their camera poses, without access to depth or shape. This formulation naturally generalizes across scenes, learning powerful geometry and appearance priors in the process. We demonstrate the potential of SRNs by evaluating them for novel view synthesis, few-shot reconstruction, joint shape and appearance interpolation, and unsupervised discovery of a non-rigid face model.

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

TL;DR

<3-5 sentence high-level summary> SRNs address the challenge of learning 3D scene structure from 2D images by introducing a continuous, 3D-aware scene representation Phi coupled with a differentiable renderer Theta that uses differentiable ray marching. This framework enforces true 3D structure and multi-view consistency while enabling high-resolution rendering without explicit 3D supervision, and it generalizes across scenes via latent codes and a hypernetwork. The authors demonstrate strong novel view synthesis, few-shot reconstruction, and latent-space interpolation on Shepard-Metzler and ShapeNet v2, including unsupervised discovery of non-rigid facial geometry and room-scale scene modeling. The work advances 3D-structure-aware neural representations by combining explicit 3D geometry with learnable appearance, enabling scalable 3D vision and graphics from 2D supervision.

Abstract

Unsupervised learning with generative models has the potential of discovering rich representations of 3D scenes. While geometric deep learning has explored 3D-structure-aware representations of scene geometry, these models typically require explicit 3D supervision. Emerging neural scene representations can be trained only with posed 2D images, but existing methods ignore the three-dimensional structure of scenes. We propose Scene Representation Networks (SRNs), a continuous, 3D-structure-aware scene representation that encodes both geometry and appearance. SRNs represent scenes as continuous functions that map world coordinates to a feature representation of local scene properties. By formulating the image formation as a differentiable ray-marching algorithm, SRNs can be trained end-to-end from only 2D images and their camera poses, without access to depth or shape. This formulation naturally generalizes across scenes, learning powerful geometry and appearance priors in the process. We demonstrate the potential of SRNs by evaluating them for novel view synthesis, few-shot reconstruction, joint shape and appearance interpolation, and unsupervised discovery of a non-rigid face model.

Paper Structure

This paper contains 28 sections, 7 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Overview: at the heart of SRNs lies a continuous, 3D-aware neural scene representation, $\Phi$, which represents a scene as a function that maps $(x,y,z)$ world coordinates to a feature representation of the scene at those coordinates (see Sec. \ref{['subsec:representation']}). A neural renderer $\Theta$, consisting of a learned ray marcher and a pixel generator, can render the scene from arbitrary novel view points (see Sec. \ref{['subsec:rendering']}).
  • Figure 2: Shepard-Metzler object from 1k-object training set, 15 observations each. SRNs (right) outperform dGQN (left) on this small dataset.
  • Figure 3: Non-rigid animation of a face. Note that mouth movement is directly reflected in the normal maps.
  • Figure 4: Normal maps for a selection of objects. We note that geometry is learned fully unsupervised and arises purely out of the perspective and multi-view geometry constraints on the image formation.
  • Figure 5: Interpolating latent code vectors of cars and chairs in the Shapenet dataset while rotating the camera around the model. Features smoothly transition from one model to another.
  • ...and 3 more figures