Table of Contents
Fetching ...

VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion

Meng Wang, Huilong Pi, Ruihui Li, Yunchuan Qin, Zhuo Tang, Kenli Li

TL;DR

VLScene tackles camera-based 3D semantic scene completion by injecting high-level semantic priors from vision-language models through vision-language guidance distillation, enriching 2D features to improve 3D spatial reasoning. It introduces a geometric-semantic sparse awareness module, consisting of Neighborhood Geometry Propagation and Sparse Semantic Interaction, to propagate and refine voxel features in sparse 3D space. The approach combines a lightweight image backbone (RepViT), VL features from LSeg, and a 2D-to-3D projection (LSS) with a distillation-augmented semantic pipeline, optimized via SSC losses and KD losses. Empirically, VLScene achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 with competitive parameter counts (47.4M) and efficient inference, demonstrating strong practical impact for autonomous driving perception.

Abstract

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving. However, images provide limited information making the model susceptible to geometric ambiguity caused by occlusion and perspective distortion. Existing methods often lack explicit semantic modeling between objects, limiting their perception of 3D semantic context. To address these challenges, we propose a novel method VLScene: Vision-Language Guidance Distillation for Camera-based 3D Semantic Scene Completion. The key insight is to use the vision-language model to introduce high-level semantic priors to provide the object spatial context required for 3D scene understanding. Specifically, we design a vision-language guidance distillation process to enhance image features, which can effectively capture semantic knowledge from the surrounding environment and improve spatial context reasoning. In addition, we introduce a geometric-semantic sparse awareness mechanism to propagate geometric structures in the neighborhood and enhance semantic information through contextual sparse interactions. Experimental results demonstrate that VLScene achieves rank-1st performance on challenging benchmarks--SemanticKITTI and SSCBench-KITTI-360, yielding remarkably mIoU scores of 17.52 and 19.10, respectively.

VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion

TL;DR

VLScene tackles camera-based 3D semantic scene completion by injecting high-level semantic priors from vision-language models through vision-language guidance distillation, enriching 2D features to improve 3D spatial reasoning. It introduces a geometric-semantic sparse awareness module, consisting of Neighborhood Geometry Propagation and Sparse Semantic Interaction, to propagate and refine voxel features in sparse 3D space. The approach combines a lightweight image backbone (RepViT), VL features from LSeg, and a 2D-to-3D projection (LSS) with a distillation-augmented semantic pipeline, optimized via SSC losses and KD losses. Empirically, VLScene achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 with competitive parameter counts (47.4M) and efficient inference, demonstrating strong practical impact for autonomous driving perception.

Abstract

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving. However, images provide limited information making the model susceptible to geometric ambiguity caused by occlusion and perspective distortion. Existing methods often lack explicit semantic modeling between objects, limiting their perception of 3D semantic context. To address these challenges, we propose a novel method VLScene: Vision-Language Guidance Distillation for Camera-based 3D Semantic Scene Completion. The key insight is to use the vision-language model to introduce high-level semantic priors to provide the object spatial context required for 3D scene understanding. Specifically, we design a vision-language guidance distillation process to enhance image features, which can effectively capture semantic knowledge from the surrounding environment and improve spatial context reasoning. In addition, we introduce a geometric-semantic sparse awareness mechanism to propagate geometric structures in the neighborhood and enhance semantic information through contextual sparse interactions. Experimental results demonstrate that VLScene achieves rank-1st performance on challenging benchmarks--SemanticKITTI and SSCBench-KITTI-360, yielding remarkably mIoU scores of 17.52 and 19.10, respectively.

Paper Structure

This paper contains 36 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Our method uses vision-language model guidance distillation versus the previous image-based method.
  • Figure 2: (a) VLScene uses 2D high-level semantic priors improve 3D scene. (b) Comparison of Params and mIoU.
  • Figure 3: The VLScene framework is proposed for camera-based 3D semantic scene completion.
  • Figure 4: Illustration of the enhanced semantic feature distillation.
  • Figure 5: Illustration of the geometric-semantic sparse awareness.
  • ...and 3 more figures