HD$^2$-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving
Zhiwen Yang, Yuxin Peng
TL;DR
Camera-based semantic scene completion (SSC) suffers from an input-output dimension gap and an annotation-density gap when inferring dense 3D occupancy from 2D images. HD$^2$-SSC tackles this with High-dimension Semantic Decoupling (HSD) to expand and decouple coarse pixel semantics into high-dimensional voxelized representations and Semantic Aggregation to cluster and differentiate semantics via cross-attention and decoupling loss. It also introduces High-density Occupancy Refinement (HOR), a detect-and-refine pipeline that aligns geometric and semantic voxel distributions to fill missing voxels and correct errors, improving semantic density. On SemanticKITTI and SSCBench-KITTI-360, HD$^2$-SSC achieves state-of-the-art IoU and mIoU, validating the effectiveness of decoupled semantic expansion and distribution-aligned refinement for practical autonomous-driving perception.
Abstract
Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving, enabling voxelized 3D scene understanding for effective scene perception and decision-making. Existing SSC methods have shown efficacy in improving 3D scene representations, but suffer from the inherent input-output dimension gap and annotation-reality density gap, where the 2D planner view from input images with sparse annotated labels leads to inferior prediction of real-world dense occupancy with a 3D stereoscopic view. In light of this, we propose the corresponding High-Dimension High-Density Semantic Scene Completion (HD$^2$-SSC) framework with expanded pixel semantics and refined voxel occupancies. To bridge the dimension gap, a High-dimension Semantic Decoupling module is designed to expand 2D image features along a pseudo third dimension, decoupling coarse pixel semantics from occlusions, and then identify focal regions with fine semantics to enrich image features. To mitigate the density gap, a High-density Occupancy Refinement module is devised with a "detect-and-refine" architecture to leverage contextual geometric and semantic structures for enhanced semantic density with the completion of missing voxels and correction of erroneous ones. Extensive experiments and analyses on the SemanticKITTI and SSCBench-KITTI-360 datasets validate the effectiveness of our HD$^2$-SSC framework.
