Table of Contents
Fetching ...

L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model

Ruoyu Wang, Yukai Ma, Yi Yao, Sheng Tao, Haoang Li, Zongzhi Zhu, Yong Liu, Xingxing Zuo

TL;DR

L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs, is proposed, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method.

Abstract

Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:https://studyingfufu.github.io/L2COcc/.

L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model

TL;DR

L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs, is proposed, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method.

Abstract

Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:https://studyingfufu.github.io/L2COcc/.

Paper Structure

This paper contains 20 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Top: An overview of our framework. Bottom: Performance of our approach on SemanticKITTI behley2019semantickitti dataset. Our proposed L2COcc is a TPV-based lightweight camera-centric SSC framework. L2COcc-C is trained using only camera data, whereas L2COcc-D is distilled from our LiDAR model but operates solely on camera input during inference. Our approach demonstrates superior performance in terms of accuracy, memory consumption, and efficiency compared to representative methods, CGFormer yu2024context and StereoScene li2023stereoscene.
  • Figure 2: The overall framework of our proposed L2COcc, comprised of three stages: voxel feature generation (indicated by the gray background), TPV-based occupancy prediction network (indicated by the blue background), and cross-modal knowledge distillation (indicated by the green background). The voxel features generated by the camera and LiDAR branches serve as inputs to the TPV-based occupancy prediction network. Meanwhile, cross-modal knowledge distillation transfers knowledge from the LiDAR-based teacher to the camera-based student, further improving the performance of the camera-only model during inference.
  • Figure 3: Qualitative visualizations on SemanticKITTI and SSCBench-KITTI360. The introduction of cross-modal knowledge distillation enables L2COcc-D to achieve superior performance compared to L2COcc-C, particularly in regions farther from the camera and at the boundaries of the camera's field of view.
  • Figure 4: PCA visualization results of TPV features $\boldsymbol{F^{Enc}}$. $\boldsymbol{F^{Enc}}$ acquired from LiDAR exhibit a distinctive and well-defined scene geometry, whereas $\boldsymbol{F^{Enc}}$ acquired from the camera display a blurred and radial appearance. The introduction of knowledge distillation effectively addresses this challenge. More qualitative results are provided at our project page: https://studyingfufu.github.io/L2COcc/.
  • Figure 5: Visualization results of TPV aggregating weight $\boldsymbol{W^{Agg}}$. We compute the average of the 3D weights across each dimension to generate three corresponding 2D weight maps. Subsequently, the weights associated with the TPV features' XY, YZ, and ZX planes are mapped to the R, G, and B channels, respectively. It's shown that the camera-based model emphasizes the YZ plane more, whereas the LiDAR-based model assigns more weight to the XY plane. Incorporating TPV aggregator distillation enables the camera model to adopt the weight allocation pattern learned by the LiDAR model.