LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection
Sanmin Kim, Youngseok Kim, Sihwan Hwang, Hyeonjun Jeong, Dongsuk Kum
TL;DR
LabelDistill tackles the gap between camera-based and LiDAR-based 3D detection by introducing a label-guided cross-modal knowledge distillation framework. It leverages aleatoric-uncertainty-free label features embedded into the LiDAR teacher's feature space via an approximate inverse of the LiDAR head $h^{-1}$ and employs feature partitioning to preserve the student’s own semantic features while learning from both LiDAR and label cues. The method couples LiDAR feature-level and response-level distillation with a dedicated label distillation path, optimized under a joint loss $\mathcal{L} = \mathcal{L}_{det} + \lambda_{1}\mathcal{L}_{lidar}^{feat} + \lambda_{2}\mathcal{L}_{label}^{feat} + \lambda_{3}\mathcal{L}_{lidar}^{resp}$. Extensive experiments on nuScenes show substantial improvements over baselines and prior LiDAR-guided KD approaches, while ablations validate the contribution of each component and the dependence on high-quality ground-truth labels. The approach maintains inference efficiency and highlights practical impact for robust camera-based 3D detection, with scope for further gains as label quality and encoding are refined.
Abstract
Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverages aleatoric uncertainty-free features from ground truth labels. In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher's head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector. Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code is available at https://github.com/sanmin0312/LabelDistill
