Table of Contents
Fetching ...

Neural Groundplans: Persistent Neural Scene Representations from a Single Image

Prafull Sharma, Ayush Tewari, Yilun Du, Sergey Zakharov, Rares Ambrus, Adrien Gaidon, William T. Freeman, Fredo Durand, Joshua B. Tenenbaum, Vincent Sitzmann

TL;DR

We address the problem of building a persistent 3D scene representation from a small number of images while disentangling static background from movable objects. The proposed approach introduces conditional neural groundplans, which are ground-aligned 2D feature grids enabling 3D queries and differentiable rendering, trained in a self-supervised manner from unlabeled multi-view videos. A nonlinear compactification enables unbounded scenes to be represented within bounded groundplans, and multi-view supervision enables learning static-dynamic disentanglement that supports single-image 3D reconstruction, instance-level segmentation, 3D bounding boxes, and scene editing. This framework provides a data-efficient backbone for 3D scene understanding with promising downstream capabilities for completion, object discovery, and interactive editing.

Abstract

We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird's-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.

Neural Groundplans: Persistent Neural Scene Representations from a Single Image

TL;DR

We address the problem of building a persistent 3D scene representation from a small number of images while disentangling static background from movable objects. The proposed approach introduces conditional neural groundplans, which are ground-aligned 2D feature grids enabling 3D queries and differentiable rendering, trained in a self-supervised manner from unlabeled multi-view videos. A nonlinear compactification enables unbounded scenes to be represented within bounded groundplans, and multi-view supervision enables learning static-dynamic disentanglement that supports single-image 3D reconstruction, instance-level segmentation, 3D bounding boxes, and scene editing. This framework provides a data-efficient backbone for 3D scene understanding with promising downstream capabilities for completion, object discovery, and interactive editing.

Abstract

We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird's-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.
Paper Structure (48 sections, 1 equation, 12 figures, 2 tables)

This paper contains 48 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Given a single image, our model infers separate 3D representations for static and dynamic scene elements, enabling high-quality novel view synthesis with plausible completion, unsupervised instance-level segmentation, 3D bounding box prediction, 3D scene editing, and extraction of object-centric 3D representations. Our model is trained self-supervised using unlabeled multi-view videos.
  • Figure 2: Groundplan inference. Given a context image, we first extract a set of CNN features. We unproject the features into 3D and re-sample them at "pillars" on top of the location of groundplan vertices. Pillars are aggregated into groundplan features using a softmax-weighted sum. The resulting 2D grid of features is decomposed into separate dynamic and static groundplans by a 2D CNN. The coordinate-encoding MLP is not visualized in this figure. Please refer to Sec. \ref{['sec:ground_planes']} for details.
  • Figure 3: Learning Static-Dynamic Disentanglement. Given multiple frames of a video, we extract per-frame, compactified, static and dynamic groundplans according to Fig. \ref{['fig:floorplan_inference']}. Static groundplans are pooled into a time-invariant groundplan. We then composite per-frame dynamic and static time-invariant groundplans via differentiable volume rendering. Our model is supervised only via a re-rendering loss on video frames. We encourage the model to explain as much of the scene density as possible with the static groundplan via a sparsity loss on per-frame dynamic volume rendering densities. The surface loss is not visualized here.
  • Figure 4: Qualitative comparisons. Comparison for novel-view synthesis given a single context view with PixelNeRF yu2020pixelnerf and uORF yu2021unsuperviseduorf.
  • Figure 5: Single-image reconstruction, disentanglement of static and dynamic objects, and novel view synthesis. Given a single input image, our method can disentangle the observed scene into static and object components based on what the model observed as not-moving and moving in the training data. In these examples, the cars are isolated in the object component as the model was training on video data of cars moving on the road.
  • ...and 7 more figures