Table of Contents
Fetching ...

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Sanmin Kim, Youngseok Kim, Sihwan Hwang, Hyeonjun Jeong, Dongsuk Kum

TL;DR

LabelDistill tackles the gap between camera-based and LiDAR-based 3D detection by introducing a label-guided cross-modal knowledge distillation framework. It leverages aleatoric-uncertainty-free label features embedded into the LiDAR teacher's feature space via an approximate inverse of the LiDAR head $h^{-1}$ and employs feature partitioning to preserve the student’s own semantic features while learning from both LiDAR and label cues. The method couples LiDAR feature-level and response-level distillation with a dedicated label distillation path, optimized under a joint loss $\mathcal{L} = \mathcal{L}_{det} + \lambda_{1}\mathcal{L}_{lidar}^{feat} + \lambda_{2}\mathcal{L}_{label}^{feat} + \lambda_{3}\mathcal{L}_{lidar}^{resp}$. Extensive experiments on nuScenes show substantial improvements over baselines and prior LiDAR-guided KD approaches, while ablations validate the contribution of each component and the dependence on high-quality ground-truth labels. The approach maintains inference efficiency and highlights practical impact for robust camera-based 3D detection, with scope for further gains as label quality and encoding are refined.

Abstract

Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverages aleatoric uncertainty-free features from ground truth labels. In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher's head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector. Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code is available at https://github.com/sanmin0312/LabelDistill

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

TL;DR

LabelDistill tackles the gap between camera-based and LiDAR-based 3D detection by introducing a label-guided cross-modal knowledge distillation framework. It leverages aleatoric-uncertainty-free label features embedded into the LiDAR teacher's feature space via an approximate inverse of the LiDAR head and employs feature partitioning to preserve the student’s own semantic features while learning from both LiDAR and label cues. The method couples LiDAR feature-level and response-level distillation with a dedicated label distillation path, optimized under a joint loss . Extensive experiments on nuScenes show substantial improvements over baselines and prior LiDAR-guided KD approaches, while ablations validate the contribution of each component and the dependence on high-quality ground-truth labels. The approach maintains inference efficiency and highlights practical impact for robust camera-based 3D detection, with scope for further gains as label quality and encoding are refined.

Abstract

Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverages aleatoric uncertainty-free features from ground truth labels. In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher's head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector. Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code is available at https://github.com/sanmin0312/LabelDistill
Paper Structure (13 sections, 7 equations, 5 figures, 8 tables)

This paper contains 13 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a) Conventional cross-modal knowledge distillation trains an image detector to mimic the features of a well-trained LiDAR detector. It could be suboptimal as it directly transfers LiDAR features with inherent imperfections to the image feature. (b) LabelDistill enhances the image detector by incorporating ground truth labels into the feature representation. This approach aims to furnish the image detector with more accurate guidance, alleviating the intrinsic limitations of LiDAR point clouds.
  • Figure 2: Overall architecture of the proposed method. Our model is trained with two distillation strategies: LiDAR distillation and label distillation. LiDAR Distillation transfers abundant spatial information to the image detector using feature-level and response-level distillation. Label Distillation provides accurate and aleatoric uncertainty-free information based on the ground truth label to compensate the limitations of LiDAR point clouds. In addition, Feature Partitioning separates the image features into three groups to preserve distinctive image features while learning from LiDAR and label features.
  • Figure 3: Architecture of the label encoder. The label encoder is designed to approximate the inverse function of the pretrained lidar detection head. The label encoder first encodes class and bounding box information and then, the mapping function transforms encoded label features into BEV space by filling the object's bounding box area with label features. Finally, the convolutional block encodes BEV label features.
  • Figure 4: Illustration of BEV feature maps in the inference stage. $F_{image}^{image}$ is undistilled image feature, $F_{image}^{lidar}$ is lidar-distilled image feature, and $F_{image}^{label}$, label-distilled image feature, and $F_{label}$ denotes label feature from the label encoder.
  • Figure 5: Comparison of the baseline (BEVDepth) and our approach. The blue circles in the BEV view highlight cases that demonstrate the advantages of our approach, including: 1) higher recall, 2) more accurate localization, and 3) fewer false positives.