Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations

Meng Wang; Fan Wu; Yunchuan Qin; Ruihui Li; Zhuo Tang; Kenli Li

Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations

Meng Wang, Fan Wu, Yunchuan Qin, Ruihui Li, Zhuo Tang, Kenli Li

TL;DR

This paper tackles robust 3D semantic scene completion in driving scenarios with dynamic objects. It introduces CDScene, which uses multimodal 2D-to-3D alignment to fuse explicit 2D semantics into 3D space and decouples scene information into dynamic and static features via monocular and stereo depth cues. A dynamic-static adaptive fusion module then combines these features to achieve robust semantic completion. Experimental results on SemanticKITTI, SSCBench-KITTI360, and SemanticKITTI-C demonstrate superior accuracy and robustness compared with state-of-the-art methods, highlighting potential benefits for autonomous driving deployment.

Abstract

The vision-based semantic scene completion task aims to predict dense geometric and semantic 3D scene representations from 2D images. However, the presence of dynamic objects in the scene seriously affects the accuracy of the model inferring 3D structures from 2D images. Existing methods simply stack multiple frames of image input to increase dense scene semantic information, but ignore the fact that dynamic objects and non-texture areas violate multi-view consistency and matching reliability. To address these issues, we propose a novel method, CDScene: Vision-based Robust Semantic Scene Completion via Capturing Dynamic Representations. First, we leverage a multimodal large-scale model to extract 2D explicit semantics and align them into 3D space. Second, we exploit the characteristics of monocular and stereo depth to decouple scene information into dynamic and static features. The dynamic features contain structural relationships around dynamic objects, and the static features contain dense contextual spatial information. Finally, we design a dynamic-static adaptive fusion module to effectively extract and aggregate complementary features, achieving robust and accurate semantic scene completion in autonomous driving scenarios. Extensive experimental results on the SemanticKITTI, SSCBench-KITTI360, and SemanticKITTI-C datasets demonstrate the superiority and robustness of CDScene over existing state-of-the-art methods.

Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations

TL;DR

Abstract

Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations

TL;DR

Abstract

Paper Structure

Table of Contents