Table of Contents
Fetching ...

SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Junwen Huang, Alexey Artemov, Yujin Chen, Shuaifeng Zhi, Kai Xu, Matthias Nießner

TL;DR

SSR-2D tackles semantic scene reconstruction from incomplete RGB-D data without 3D annotations by leveraging 2D supervision and differentiable rendering. The method fuses incomplete TSDF geometry with learned color and semantic predictions in a 3D U-Net, and supervises via differentiable rendering of depth, color, and semantics from both real and virtual views. A key novelty is pseudo-supervision from a generic 2D semantic predictor and a self-supervised loop with virtual views, enabling end-to-end training that achieves state-of-the-art semantic scene completion on Matterport3D and ScanNet. This 2D-driven approach reduces reliance on expensive 3D labels while maintaining competitive 3D completion and semantic accuracy, with practical impact for robotics, AR, and large-scale scene understanding.

Abstract

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics with only 2D labeling which can be either manual or machine-generated. Our key technical innovation is to leverage differentiable rendering of color and semantics to bridge 2D observations and unknown 3D space, using the observed RGB images and 2D semantics as supervision, respectively. We additionally develop a learning pipeline and corresponding method to enable learning from imperfect predicted 2D labels, which could be additionally acquired by synthesizing in an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision loop for semantics. As a result, our end-to-end trainable solution jointly addresses geometry completion, colorization, and semantic mapping from limited RGB-D images, without relying on any 3D ground-truth information. Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines even with costly 3D annotations in predicting both geometry and semantics. To our knowledge, our method is also the first 2D-driven method addressing completion and semantic segmentation of real-world 3D scans simultaneously.

SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

TL;DR

SSR-2D tackles semantic scene reconstruction from incomplete RGB-D data without 3D annotations by leveraging 2D supervision and differentiable rendering. The method fuses incomplete TSDF geometry with learned color and semantic predictions in a 3D U-Net, and supervises via differentiable rendering of depth, color, and semantics from both real and virtual views. A key novelty is pseudo-supervision from a generic 2D semantic predictor and a self-supervised loop with virtual views, enabling end-to-end training that achieves state-of-the-art semantic scene completion on Matterport3D and ScanNet. This 2D-driven approach reduces reliance on expensive 3D labels while maintaining competitive 3D completion and semantic accuracy, with practical impact for robotics, AR, and large-scale scene understanding.

Abstract

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics with only 2D labeling which can be either manual or machine-generated. Our key technical innovation is to leverage differentiable rendering of color and semantics to bridge 2D observations and unknown 3D space, using the observed RGB images and 2D semantics as supervision, respectively. We additionally develop a learning pipeline and corresponding method to enable learning from imperfect predicted 2D labels, which could be additionally acquired by synthesizing in an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision loop for semantics. As a result, our end-to-end trainable solution jointly addresses geometry completion, colorization, and semantic mapping from limited RGB-D images, without relying on any 3D ground-truth information. Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines even with costly 3D annotations in predicting both geometry and semantics. To our knowledge, our method is also the first 2D-driven method addressing completion and semantic segmentation of real-world 3D scans simultaneously.
Paper Structure (19 sections, 5 equations, 10 figures, 9 tables)

This paper contains 19 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Starting with sparse RGB-D images, our SSR-2D jointly predicts complete geometry, appearance, and semantic labels from incomplete real-world scans, without access to any in-place 3D ground-truth annotations during training. Instead, we rely on 2D RGB images and their semantic segmentations obtained from a pre-trained semantic predictor.
  • Figure 2: Our method accepts a fused but incomplete TSDF reconstruction as input and convolves it with a 3D encoder-decoder CNN, jointly producing complete 3D geometry, colorization, and semantic segmentation. In the general case (Section \ref{['method:supervision']}), we generate 2D depth, color, and semantic images using either the original viewpoints $U$, or arbitrary virtual viewpoints $V$, by a differentiable rendering technique. These synthesized views are used to supervise training w.r.t. the original RGB-D images in a pseudo-supervised training loop, or w.r.t. multi-view consistency in a self-supervised training loop.
  • Figure 3: Our 3D CNN architecture comprises two encoders for rgb]0.82,0.93,0.75geometry and rgb]0.754,0.830,0.960color, and three decoders for completed rgb]1.000,0.904,0.603geometry, rgb]1.000,0.602,0.601semantics, and rgb]0.973,0.797,0.680color, respectively.
  • Figure 4: To train our approach with pseudo-GT labels, we optimize: (a) for the original views, deviations between segmented RGBs $\phi_{\text{seg}}(I_u)$ and same-view semantic renderings $\mathcal{R}(\widehat{s}; u)$; (b) for virtually sampled views, deviations between semantic predictions of RGB renderings $\phi_{\text{seg}}(\mathcal{R}(\widehat{c}; v))$ and direct semantic renderings $\mathcal{R}(\widehat{s}; v)$ in the same view.
  • Figure 5: Example virtual views synthesized for the Matterport3D dataset. Virtual 2D view selection enables incorporating richer context information of underlying 3D scenes into the renderings for training.
  • ...and 5 more figures