Table of Contents
Fetching ...

Unleashing Semantic and Geometric Priors for 3D Scene Completion

Shiyuan Chen, Wei Sui, Bohao Zhang, Zeyd Boukhers, John See, Cong Yang

TL;DR

FoundationSSC tackles the semantic-geometry conflict in camera-based SSC by introducing dual decoupling: a Foundation Encoder provides separate semantic priors and high-fidelity stereo costs, while decoupled semantic and geometric pathways refine these priors. A Geometry-Aware Context Adapter and a Disparity-to-Depth Volume Mapping preserve geometric consistency and probabilistic depth cues, which are then lifted via a Hybrid View Transformation into 3D space. Axis-Aware Fusion anisotropically fuses the resulting 3D feature volumes to form a unified representation for final prediction. The approach achieves state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360, with clear gains in both mIoU and IoU metrics, and demonstrates robustness across long-tail classes and challenging scenes. This framework provides a scalable pathway to leverage vision foundation models for precise 3D scene understanding in autonomous driving and robotics, with strong potential for real-world deployment and future temporal extensions.

Abstract

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU.

Unleashing Semantic and Geometric Priors for 3D Scene Completion

TL;DR

FoundationSSC tackles the semantic-geometry conflict in camera-based SSC by introducing dual decoupling: a Foundation Encoder provides separate semantic priors and high-fidelity stereo costs, while decoupled semantic and geometric pathways refine these priors. A Geometry-Aware Context Adapter and a Disparity-to-Depth Volume Mapping preserve geometric consistency and probabilistic depth cues, which are then lifted via a Hybrid View Transformation into 3D space. Axis-Aware Fusion anisotropically fuses the resulting 3D feature volumes to form a unified representation for final prediction. The approach achieves state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360, with clear gains in both mIoU and IoU metrics, and demonstrates robustness across long-tail classes and challenging scenes. This framework provides a scalable pathway to leverage vision foundation models for precise 3D scene understanding in autonomous driving and robotics, with strong potential for real-world deployment and future temporal extensions.

Abstract

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU.

Paper Structure

This paper contains 30 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The conventional SSC framework (top) depends on a coupled image encoder, providing limited feature priors for both the semantic and geometric branches, resulting in an inherent trade-off. In contrast, our FoundationSSC (bottom) utilises a foundation encoder that provides robust decoupled semantic and geometric priors, effectively addressing conflicts at both the source and pathway levels.
  • Figure 2: Overview of our proposed FoundationSSC framework. It begins with a Foundation Encoder producing decoupled priors, which are then enhanced in Decoupled Semantic Geometric Pathways. A Hybrid View Transformation subsequently lifts and fuses these priors into a unified 3D volume, which is processed by a Decoding Head to yield the final prediction.
  • Figure 3: Illustration of the DDVM module, which transforms disparity cost volume to depth distribution.
  • Figure 4: Illustration of the fusion unit for a single axis (e.g., the $Z$-axis) within the proposed AAF module.
  • Figure 5: Qualitative visualisation results on SemanticKITTI validation set. Our FoundationSSC predicts objects with sharper geometric boundaries and more accurate semantic labels, outperforming competing works.
  • ...and 2 more figures