Table of Contents
Fetching ...

SimDistill: Simulated Multi-modal Distillation for BEV 3D Object Detection

Haimei Zhao, Qiming Zhang, Shanshan Zhao, Zhe Chen, Jing Zhang, Dacheng Tao

TL;DR

SimDistill tackles the challenge of camera-only BEV 3D object detection by introducing a simulated multi-modal distillation framework that leverages a LiDAR-camera fusion-based teacher and a nearly identical two-branch student. The core idea is to enable intra-modal, cross-modal, and multi-modal fusion distillation within BEV space, aided by a Geometry Compensation Module to bridge geometric gaps between modalities. Quantitatively, SimDistill yields substantial gains over the camera-only baseline BEVFusion-C, e.g., $4.8\%$ mAP and $4.1\%$ NDS on nuScenes, and demonstrates strong performance across ablations and backbone variants. The approach offers a practical path to camera-only deployments that still benefit from LiDAR-like multi-modal knowledge, with broad applicability to other teacher-student configurations and future multi-modal distillation research.

Abstract

Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8\% mAP and 4.1\% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.

SimDistill: Simulated Multi-modal Distillation for BEV 3D Object Detection

TL;DR

SimDistill tackles the challenge of camera-only BEV 3D object detection by introducing a simulated multi-modal distillation framework that leverages a LiDAR-camera fusion-based teacher and a nearly identical two-branch student. The core idea is to enable intra-modal, cross-modal, and multi-modal fusion distillation within BEV space, aided by a Geometry Compensation Module to bridge geometric gaps between modalities. Quantitatively, SimDistill yields substantial gains over the camera-only baseline BEVFusion-C, e.g., mAP and NDS on nuScenes, and demonstrates strong performance across ablations and backbone variants. The approach offers a practical path to camera-only deployments that still benefit from LiDAR-like multi-modal knowledge, with broad applicability to other teacher-student configurations and future multi-modal distillation research.

Abstract

Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8\% mAP and 4.1\% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.
Paper Structure (39 sections, 9 equations, 8 figures, 11 tables)

This paper contains 39 sections, 9 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Comparison of our SimDistill with previous distillation frameworks. (a) Intra-modal distillation between camera-only teacher and student models cannot learn accurate 3D information due to the limited capacity of the teacher model for inferring 3D geometry. (b) Cross-modal distillation between the LiDAR teacher and Camera student enables learning useful 3D information from the teacher but suffers from the large cross-modal gap. (c) Our simulated multi-modal distillation enables effective knowledge distillation within/between modalities and fully takes advantage of complementary information from different modalities.
  • Figure 2: Overall pipeline of SimDistill. It consists of a fusion-based teacher model (top) and a simulated multi-modal student model (bottom). SimDistill supports (1) Intra-Modal Distillation (IMD) between the camera features of the teacher and student; (2) Cross-Modal Distillation (CMD) between the teacher's LiDAR feature and the student's Simulated-LiDAR feature. (3) Multi-Modal fusion Distillation (MMD) between the fusion features (MMD-F) and predictions (MMD-P) of the teacher and student. The workflows of the (simulated) LiDAR and camera branches are denoted by red and blue arrows, respectively.
  • Figure 3: Illustration of Geometry Compensation Module (GCM). The colorful voxels denote learned features of the target object. Best viewed with zoom-in.
  • Figure 4: Comparison of the overall pipeline of BEVFusion-C and our SimDisitll (student model). Both models take multi-view images as input while our method extends into two branches to additionally learn simulated LiDAR features and conduct multi-modal knowledge distillation.
  • Figure 5: Visualization of the detection results inferred by SimDistill, on LiDAR top view. The green and red boxes represent the prediction and ground truth, respectively.
  • ...and 3 more figures