Table of Contents
Fetching ...

CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation

Lingjun Zhao, Jingyu Song, Katherine A. Skinner

TL;DR

This work proposes Camera-Radar Knowledge Distillation (CRKD) to bridge the performance gap between LC and CR detectors with a novel cross-modality KD framework and uses the Bird'View (BEV) representation as the shared feature space to enable effective knowledge distillation.

Abstract

In the field of 3D object detection for autonomous driving, LiDAR-Camera (LC) fusion is the top-performing sensor configuration. Still, LiDAR is relatively high cost, which hinders adoption of this technology for consumer automobiles. Alternatively, camera and radar are commonly deployed on vehicles already on the road today, but performance of Camera-Radar (CR) fusion falls behind LC fusion. In this work, we propose Camera-Radar Knowledge Distillation (CRKD) to bridge the performance gap between LC and CR detectors with a novel cross-modality KD framework. We use the Bird's-Eye-View (BEV) representation as the shared feature space to enable effective knowledge distillation. To accommodate the unique cross-modality KD path, we propose four distillation losses to help the student learn crucial features from the teacher model. We present extensive evaluations on the nuScenes dataset to demonstrate the effectiveness of the proposed CRKD framework. The project page for CRKD is https://song-jingyu.github.io/CRKD.

CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation

TL;DR

This work proposes Camera-Radar Knowledge Distillation (CRKD) to bridge the performance gap between LC and CR detectors with a novel cross-modality KD framework and uses the Bird'View (BEV) representation as the shared feature space to enable effective knowledge distillation.

Abstract

In the field of 3D object detection for autonomous driving, LiDAR-Camera (LC) fusion is the top-performing sensor configuration. Still, LiDAR is relatively high cost, which hinders adoption of this technology for consumer automobiles. Alternatively, camera and radar are commonly deployed on vehicles already on the road today, but performance of Camera-Radar (CR) fusion falls behind LC fusion. In this work, we propose Camera-Radar Knowledge Distillation (CRKD) to bridge the performance gap between LC and CR detectors with a novel cross-modality KD framework. We use the Bird's-Eye-View (BEV) representation as the shared feature space to enable effective knowledge distillation. To accommodate the unique cross-modality KD path, we propose four distillation losses to help the student learn crucial features from the teacher model. We present extensive evaluations on the nuScenes dataset to demonstrate the effectiveness of the proposed CRKD framework. The project page for CRKD is https://song-jingyu.github.io/CRKD.
Paper Structure (34 sections, 9 equations, 5 figures, 14 tables)

This paper contains 34 sections, 9 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: We propose CRKD to conduct a novel cross-modality knowledge distillation path from a LiDAR-camera teacher to a camera-radar student. We present a radar chart to illustrate the complementary nature of these sensing configurations and the improvement that CRKD can enable.
  • Figure 2: An overview of the proposed cross-modality (LC-to-CR) distillation framework CRKD. We narrow down the modality gap by unifying both the teacher and the student into the BEV space with similar 3D object detector structure. We refine the model design to enable adaptive fusion and design four novel distillation losses for effective cross-modality KD. During inference, only CR input is needed.
  • Figure 3: Qualitative results on nuScenes. We show zoomed-in views in panel b and c for the highlighted regions in panel a, with the border dash as the correspondence. We show the ground truth annotation in red, teacher prediction in green, student prediction in yellow, CRKD prediction in blue, and radar points in magenta. In (1a) to (1c), we show an example frame where CRKD has more accurate predictions and fewer false predictions than the student model. In (2a) to (2c) we show another example frame where CRKD even outperforms the LC teacher by detecting a missed car and rejecting several false predictions. Best viewed on screen and in color.
  • Figure 4: Visualization of the ungated and gated camera feature maps in the teacher detector. The scene geometry can be more easily interpreted from the gated feature map, as it has encoded information from the LiDAR point cloud. Best viewed in color.
  • Figure 5: More Qualitative results on nuScenes. We show zoomed-in views in panel b and c for the highlighted regions in panel a, with the border dash as the correspondence. The highlighted regions are enclosed with border dash in ellipse. We show the ground truth annotation in red, teacher prediction in green, student prediction in yellow, CRKD prediction in blue, and radar points in magenta. In (1a) to (1c), we show an example frame where CRKD can capture the object missed by the student with the guidance of the teacher. In (2a) to (2c) we show an example frame where CRKD can reject false predictions by the student model. In (3a) to (3b) and (4a) to (4b), we show two examples where CRKD rejects false predictions by the teacher model and generates more accurate predictions. In (5a) to (5c), we show an example where CRKD outperforms both the teacher and student models by capturing missed objects and generating less false predictions. Best viewed on screen and in color.