Unleashing Semantic and Geometric Priors for 3D Scene Completion
Shiyuan Chen, Wei Sui, Bohao Zhang, Zeyd Boukhers, John See, Cong Yang
TL;DR
FoundationSSC tackles the semantic-geometry conflict in camera-based SSC by introducing dual decoupling: a Foundation Encoder provides separate semantic priors and high-fidelity stereo costs, while decoupled semantic and geometric pathways refine these priors. A Geometry-Aware Context Adapter and a Disparity-to-Depth Volume Mapping preserve geometric consistency and probabilistic depth cues, which are then lifted via a Hybrid View Transformation into 3D space. Axis-Aware Fusion anisotropically fuses the resulting 3D feature volumes to form a unified representation for final prediction. The approach achieves state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360, with clear gains in both mIoU and IoU metrics, and demonstrates robustness across long-tail classes and challenging scenes. This framework provides a scalable pathway to leverage vision foundation models for precise 3D scene understanding in autonomous driving and robotics, with strong potential for real-world deployment and future temporal extensions.
Abstract
Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU.
