Table of Contents
Fetching ...

OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion

Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, Shuchang Zhou

TL;DR

OccDepth addresses 3D Semantic Scene Completion from stereo images by introducing depth-aware fusion and occupancy priors. The method combines Stereo Soft Feature Assignment to lift stereo 2D features into 3D, and an Occupancy Aware Depth module that injects depth priors via a learned depth distribution with discretization and distillation supervision ($L_{depth}$) from a dense stereo depth model. Empirical results show substantial gains over RGB-only baselines (e.g., $+4.82$ mIoU on SemanticKITTI) and competitiveness with 2.5D/3D-input SSC methods, while sketching a path for robust indoor evaluation via SemanticTartanAir. The approach enables accurate 3D completion using cheaper image sensors, with practical impact for autonomous driving and robotics.

Abstract

3D Semantic Scene Completion (SSC) can provide dense geometric and semantic scene representations, which can be applied in the field of autonomous driving and robotic systems. It is challenging to estimate the complete geometry and semantics of a scene solely from visual images, and accurate depth information is crucial for restoring 3D geometry. In this paper, we propose the first stereo SSC method named OccDepth, which fully exploits implicit depth information from stereo images (or RGBD images) to help the recovery of 3D geometric structures. The Stereo Soft Feature Assignment (Stereo-SFA) module is proposed to better fuse 3D depth-aware features by implicitly learning the correlation between stereo images. In particular, when the input are RGBD image, a virtual stereo images can be generated through original RGB image and depth map. Besides, the Occupancy Aware Depth (OAD) module is used to obtain geometry-aware 3D features by knowledge distillation using pre-trained depth models. In addition, a reformed TartanAir benchmark, named SemanticTartanAir, is provided in this paper for further testing our OccDepth method on SSC task. Compared with the state-of-the-art RGB-inferred SSC method, extensive experiments on SemanticKITTI show that our OccDepth method achieves superior performance with improving +4.82% mIoU, of which +2.49% mIoU comes from stereo images and +2.33% mIoU comes from our proposed depth-aware method. Our code and trained models are available at https://github.com/megvii-research/OccDepth.

OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion

TL;DR

OccDepth addresses 3D Semantic Scene Completion from stereo images by introducing depth-aware fusion and occupancy priors. The method combines Stereo Soft Feature Assignment to lift stereo 2D features into 3D, and an Occupancy Aware Depth module that injects depth priors via a learned depth distribution with discretization and distillation supervision () from a dense stereo depth model. Empirical results show substantial gains over RGB-only baselines (e.g., mIoU on SemanticKITTI) and competitiveness with 2.5D/3D-input SSC methods, while sketching a path for robust indoor evaluation via SemanticTartanAir. The approach enables accurate 3D completion using cheaper image sensors, with practical impact for autonomous driving and robotics.

Abstract

3D Semantic Scene Completion (SSC) can provide dense geometric and semantic scene representations, which can be applied in the field of autonomous driving and robotic systems. It is challenging to estimate the complete geometry and semantics of a scene solely from visual images, and accurate depth information is crucial for restoring 3D geometry. In this paper, we propose the first stereo SSC method named OccDepth, which fully exploits implicit depth information from stereo images (or RGBD images) to help the recovery of 3D geometric structures. The Stereo Soft Feature Assignment (Stereo-SFA) module is proposed to better fuse 3D depth-aware features by implicitly learning the correlation between stereo images. In particular, when the input are RGBD image, a virtual stereo images can be generated through original RGB image and depth map. Besides, the Occupancy Aware Depth (OAD) module is used to obtain geometry-aware 3D features by knowledge distillation using pre-trained depth models. In addition, a reformed TartanAir benchmark, named SemanticTartanAir, is provided in this paper for further testing our OccDepth method on SSC task. Compared with the state-of-the-art RGB-inferred SSC method, extensive experiments on SemanticKITTI show that our OccDepth method achieves superior performance with improving +4.82% mIoU, of which +2.49% mIoU comes from stereo images and +2.33% mIoU comes from our proposed depth-aware method. Our code and trained models are available at https://github.com/megvii-research/OccDepth.
Paper Structure (21 sections, 10 equations, 5 figures, 4 tables)

This paper contains 21 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: RGB based Semantic Scene Completion with/without depth-aware. (a) Our proposed OccDepth method can detect smaller and farther objects. (b) Our proposed OccDepth method complete road better.
  • Figure 2: The process pipeline of the proposed OccDepth. The 3D SSC is inferred from stereo images with bridging a Stereo-SFA module to lift features to 3D space, an OAD module to enhance depth prediction, and a 3D U-Net to extract geometry and semantics. The stereo depth network is only used in training for giving a depth supervision.
  • Figure 3: The illustration of stereo soft feature assignment module. The sampled 2D features are fused to 3D voxel feature.
  • Figure 4: The illustration of the occupancy aware depth module. For simplicity, only the processing flow of single-shot $V_{D}$ is shown. The OAD module is used to introduce spatial occupancy prior by the predicted depth information.
  • Figure 5: Qualitative study on (\ref{['fig:qualitative_TartanAir']}) SemanticTartanAir and (\ref{['fig:qualitative_kitti']}) SemanticKITTI. The input is shown on the leftmost and the ground truth is shown on the rightmost. OccDepth captures better scene layout on both datasets.