Towards 3D Object-Centric Feature Learning for Semantic Scene Completion
Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang
TL;DR
This work tackles semantic scene completion from a monocular image by addressing the limitations of ego-centric 3D feature fusion. It introduces Ocean, an object-centric SSC framework that leverages MobileSAM priors through SemGroup Dual Attention (SGDA), consisting of 3D Semantic Group Attention (SGA3D) and Global Similarity-Guided Attention (GSGA), plus an Instance-aware Local Diffusion (ILD) module to diffuse instance-level information into the BEV and 3D representations. The method demonstrates state-of-the-art performance on SemanticKITTI and SSCBench-KITTI360, with clear ablations validating the contributions of object-centric aggregation, depth-aware 3D extension, and diffusion-based refinement. Overall, Ocean showcases the viability and benefits of explicit object-centric reasoning for vision-based 3D SSC, with potential impact on autonomous driving perception pipelines.
Abstract
Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.
