Table of Contents
Fetching ...

SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion

Zhiwen Yang, Yuxin Peng

TL;DR

The Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC is proposed, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information and generates SSC results with realistic details.

Abstract

Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion

TL;DR

The Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC is proposed, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information and generates SSC results with realistic details.

Abstract

Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

Paper Structure

This paper contains 16 sections, 17 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall architecture of our SPHERE approach. The Semantic-guided Gaussian Initialization (SGI) module leverages a dual-branch encoder to exploit local and global semantics, and then selects focal voxels as anchors for effective and efficient Gaussian initialization. The Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details, then enhances semantic-geometry consistency via focal distribution alignment between the voxel and Gaussian semantics.
  • Figure 2: Illustration of the Semantic Gaussian Initialization process. The voxel-wise feature similarities are first computed between the voxel and TPV features to select top-$k$ focal voxels as anchors. Then, a Gaussian projection layer is employed to generate Gaussian properties according to the semantics and positions of focal anchors.
  • Figure 3: Illustration of the Semantic Spherical Harmonics. The initial Gaussian semantics are passed through SH projection to cope with SH coefficients, exploiting physical-aware contextual details. Furthermore, we employ an orthogonal loss on the projection matrix, considering that spherical harmonics form a complete set of orthogonal functions.
  • Figure 4: Effect of the number of Gaussians on the SSC performance for the SemanticKITTI validation set.
  • Figure 5: Qualitative visualization results on the SemanticKITTI validation set. Cyan boxes outline the occupancy ground truth. Red boxes indicate false occupancy predictions of the best comparison method SGN and baseline method VoxFormer, while green boxes indicate improved scene completion results generated by our SPHERE approach. Better viewed when zoomed in.