Table of Contents
Fetching ...

Semantic Scene Completion from a Single Depth Image

Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, Thomas Funkhouser

TL;DR

This work tackles semantic scene completion from a single depth image by jointly predicting volumetric occupancy and object category labels for voxels in the camera frustum. It introduces SSCNet, a 3D ConvNet that uses a dilated 3D context module and multi-scale fusion to capture large-scale context, trained on the large SUNCG synthetic dataset with dense voxel labels. The results show that combining occupancy and semantic supervision, along with synthetic data and a view-independent TSDF encoding, yields significant improvements over task-specific baselines, and that architectural choices like a larger receptive field and multi-scale aggregation materially boost performance. The approach advances robust 3D scene understanding from minimal input, with potential impact on robotics, scene understanding, and 3D reconstruction tasks where complete scene semantics are required from partial observations.

Abstract

This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG - a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task.

Semantic Scene Completion from a Single Depth Image

TL;DR

This work tackles semantic scene completion from a single depth image by jointly predicting volumetric occupancy and object category labels for voxels in the camera frustum. It introduces SSCNet, a 3D ConvNet that uses a dilated 3D context module and multi-scale fusion to capture large-scale context, trained on the large SUNCG synthetic dataset with dense voxel labels. The results show that combining occupancy and semantic supervision, along with synthetic data and a view-independent TSDF encoding, yields significant improvements over task-specific baselines, and that architectural choices like a larger receptive field and multi-scale aggregation materially boost performance. The approach advances robust 3D scene understanding from minimal input, with potential impact on robotics, scene understanding, and 3D reconstruction tasks where complete scene semantics are required from partial observations.

Abstract

This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG - a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task.

Paper Structure

This paper contains 42 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Semantic scene completion. (a) Input single-view depth map (b) Visible surface from the depth map; color is for visualization only. (c) Semantic scene completion result: our model jointly predicts volumetric occupancy and object categories for each of the 3D voxels in the view frustum. Note that the entire volume occupied by the bed is predicted to have the bed category.
  • Figure 2: Given a single-view depth observation of a 3D scene the goal of our SSCNet is to predict both occupancy and object category for the voxels on the observed surface and occluded regions.
  • Figure 3: SSCNet: Semantic scene completion network. Taking a single depth map as input, the network predicts occupancy and object labels for each voxel in the view frustum. The convolution parameters are shown as (number of filters, kernel size, stride, dilation).
  • Figure 4: Comparison of receptive fields and voxel sizes between SSCNet and prior work. (a) Object centric networks such as 3DShapeNets and VoxNet scale objects into the same 3D voxel grid thus discarding physical size information. In (b)-(d), colored regions indicate the effective receptive field of a single neuron in the last layer of each 3D ConvNet. With the help of 3D dilated convolution SSCNet drastically increases its receptive field compared to other 3D ConvNet architectures DSS3DMatch thus capturing richer 3D contextual information.
  • Figure 5: Different encodings for surface (a). The projective TSDF (b) is computed with respect to the camera and is therefore view-dependent. The accurate TSDF (c) has less view dependency but exhibits strong gradients in empty space along the occlusion boundary (circled in gray). In contrast, the flipped TSDF (d) has the strongest gradient near the surface.
  • ...and 10 more figures