Table of Contents
Fetching ...

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

Rui Ding, Meng Yang, Nanning Zheng

TL;DR

This paper systematically investigates the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue, and proposes a selective learning approach named MonoSTL to overcome these issues.

Abstract

Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

TL;DR

This paper systematically investigates the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue, and proposes a selective learning approach named MonoSTL to overcome these issues.

Abstract

Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.
Paper Structure (18 sections, 9 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 18 sections, 9 equations, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: Negative transfer problem in cross-modality distillation. (a) A visual example of positive/negative transfer. (b) The accuracy of monocular 3D detection is seriously decreased when fully transferring the features to student from teacher with a similar architecture based on MonoDLE* . Our approach uses selective transfer learning to alleviate the negative transfer problem and considerably improves the accuracy of four base models on (c) KITTI and (d) NuScenes datasets.
  • Figure 2: Overview of our MonoSTL framework. The framework comprises three components including the teacher network, the student network, and three distillation modules. First, we train the teacher network with GT depth map from LiDAR. It adopts similar architectures to the student network. Second, three distillation modules are used to selectively transfer features from the teacher network to the student network including our DASRD and DASFD modules as well as the general response distillation module. Finally, only the student network is retained to predict 3D objects from single images in the inference stage.
  • Figure 3: Visual results in BEV view. The first row in each example displays 3D boxes detected by our approach on images. The second row compares our approach with the base model MonoDLE* in BEV view. The third row compares our approach with the recent Monodistill using the same base model in BEV view. The improvement of our approach is mainly attributed to the developed DASFD and DASRD modules.
  • Figure 4: Visual results in lateral view. The first and second col in each example displays 3D boxes detected by our approach and Monodistill, respectively. Our approach performs better than Monodistill in cases of false positives and false negatives.
  • Figure 5: Failure examples. A few objects are missed or detected inaccurately by our approach. The failure case is mainly caused by the inaccurate depth estimation of object centers.
  • ...and 1 more figures