Table of Contents
Fetching ...

Camera-based 3D Semantic Scene Completion with Sparse Guidance Network

Jianbiao Mei, Yu Yang, Mengmeng Wang, Junyu Zhu, Jongwon Ra, Yukai Ma, Laijian Li, Yong Liu

TL;DR

This work tackles camera-based Semantic Scene Completion (SSC) by introducing SGN, a one-stage dense-sparse-dense framework that propagates semantic information from seed voxels to the entire 3D scene using depth-informed seed selection and geometry-aware guidance. The method integrates a depth-based sparse voxel proposal network (SVPN), geometry guidance via an auxiliary occupancy head, and a hybrid semantic guidance pathway to enhance intra-class separation, followed by a voxel aggregation step and a multi-scale semantic propagation module. SGN is trained end-to-end with a combination of losses including geometry, occupancy, semantic, and scene-class affinity terms, achieving state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 while maintaining a lightweight footprint (e.g., SGN-L with 12.5M parameters). The results demonstrate strong short-range accuracy, improved segmentation boundaries, and favorable efficiency, highlighting SGN’s potential for real-time, resource-constrained autonomous driving systems. The work also confirms the robustness of the approach across indoor datasets (NYUv2), suggesting good generalization of the dense-sparse-dense paradigm with hybrid guidance for 3D semantic perception.

Abstract

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80\% mIoU and 45.45\% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

Camera-based 3D Semantic Scene Completion with Sparse Guidance Network

TL;DR

This work tackles camera-based Semantic Scene Completion (SSC) by introducing SGN, a one-stage dense-sparse-dense framework that propagates semantic information from seed voxels to the entire 3D scene using depth-informed seed selection and geometry-aware guidance. The method integrates a depth-based sparse voxel proposal network (SVPN), geometry guidance via an auxiliary occupancy head, and a hybrid semantic guidance pathway to enhance intra-class separation, followed by a voxel aggregation step and a multi-scale semantic propagation module. SGN is trained end-to-end with a combination of losses including geometry, occupancy, semantic, and scene-class affinity terms, achieving state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 while maintaining a lightweight footprint (e.g., SGN-L with 12.5M parameters). The results demonstrate strong short-range accuracy, improved segmentation boundaries, and favorable efficiency, highlighting SGN’s potential for real-time, resource-constrained autonomous driving systems. The work also confirms the robustness of the approach across indoor datasets (NYUv2), suggesting good generalization of the dense-sparse-dense paradigm with hybrid guidance for 3D semantic perception.

Abstract

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80\% mIoU and 45.45\% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.
Paper Structure (37 sections, 13 equations, 8 figures, 11 tables)

This paper contains 37 sections, 13 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: (a) Fully dense processing with heavy and complex 3D model. (b) MAE-like architecture in a "sparse-to-dense" manner. (c) Our "dense-sparse-dense" design with hybrid guidance and semantic propagation. "A" denotes the voxel aggregation layer for geometry cues.
  • Figure 2: Overall framework of our SGN. The image encoder extracts 2D features, establishing the foundation for the 3D features generated through view transformation. An auxiliary occupancy head is applied to provide geometry guidance. The sparse semantic guidance consists of two parts: sparse voxel proposal and semantic guidance. The depth-based occupancy prediction is designed for the sparse voxel proposal. This proposal, along with the 3D features, is fed into the subsequent semantic guidance (depicted in Figure \ref{['fig:sdb']}) to index seed features and inject semantic context into these seed features. Afterward, the voxel aggregation layer combines the semantic-aware seed features, geometry prior from the non-seed features, and occupancy-aware features from the depth-based occupancy prediction. This forms the informative voxel features processed by the multi-scale semantic propagation for the final prediction.
  • Figure 3: Detailed architecture of the proposed semantic guidance module (SGM). The sparse encoder block (SEB) consists of a sparse feature encoder and a sparse geometry feature encoder adopted from ye2022efficient.
  • Figure 4: Visual comparison of our SGN-T with state-of-the-art methods on SemanticKITTI validation. Compared to VoxFormer-T and MonoScene, our SGN-T generates more precise segmentation boundaries (labeled in red circles).
  • Figure 5: Effect of temporal frames. The frames are sampled every three frames. Memory denotes training memory.
  • ...and 3 more figures