Table of Contents
Fetching ...

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang

TL;DR

This work tackles semantic scene completion from a monocular image by addressing the limitations of ego-centric 3D feature fusion. It introduces Ocean, an object-centric SSC framework that leverages MobileSAM priors through SemGroup Dual Attention (SGDA), consisting of 3D Semantic Group Attention (SGA3D) and Global Similarity-Guided Attention (GSGA), plus an Instance-aware Local Diffusion (ILD) module to diffuse instance-level information into the BEV and 3D representations. The method demonstrates state-of-the-art performance on SemanticKITTI and SSCBench-KITTI360, with clear ablations validating the contributions of object-centric aggregation, depth-aware 3D extension, and diffusion-based refinement. Overall, Ocean showcases the viability and benefits of explicit object-centric reasoning for vision-based 3D SSC, with potential impact on autonomous driving perception pipelines.

Abstract

Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

TL;DR

This work tackles semantic scene completion from a monocular image by addressing the limitations of ego-centric 3D feature fusion. It introduces Ocean, an object-centric SSC framework that leverages MobileSAM priors through SemGroup Dual Attention (SGDA), consisting of 3D Semantic Group Attention (SGA3D) and Global Similarity-Guided Attention (GSGA), plus an Instance-aware Local Diffusion (ILD) module to diffuse instance-level information into the BEV and 3D representations. The method demonstrates state-of-the-art performance on SemanticKITTI and SSCBench-KITTI360, with clear ablations validating the contributions of object-centric aggregation, depth-aware 3D extension, and diffusion-based refinement. Overall, Ocean showcases the viability and benefits of explicit object-centric reasoning for vision-based 3D SSC, with potential impact on autonomous driving perception pipelines.

Abstract

Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

Paper Structure

This paper contains 30 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison between our object-centric learning guided by MobileSAM and previous scene-level paradigms.
  • Figure 2: Overview of the proposed Ocean architecture. Given the monocular image as input, we first extract visual features using an image encoder and lift them into 3D space following LSS. To enable object-centric feature learning, we segment the scene using MobileSAM and design the SGDA block to aggregate features through both local and global attention. Furthermore, we propose the ILD module to refine the overall scene representation by incorporating instance-level features.
  • Figure 3: The details of SemGroup Dual Attention Block.
  • Figure 4: The Semantic Grouping. 3D query proposals are projected onto the image plane, assigned instance IDs via nearest-neighbor sampling, and clustered with image pixels of the same instance for aggregation using linear attention.
  • Figure 5: The details of the Dynamic Instance Decoder. Given the instance features, we reconstruct them into the scene-level BEV representation using a transposed convolutional decoder. Furthermore, we employ Gumbel Softmax to enable dynamic instance selection.
  • ...and 3 more figures