Table of Contents
Fetching ...

Efficient Multimodal 3D Object Detector via Instance-Level Contrastive Distillation

Zhuoqun Su, Huimin Lu, Shuaifeng Jiao, Junhao Xiao, Yaonan Wang, Xieyuanli Chen

TL;DR

This work tackles cross-modal heterogeneity in multimodal 3D object detection by introducing Instance-level Contrastive Distillation (ICD), which transfers spatial knowledge from a frozen LiDAR teacher to the RGB image encoder using object-aware, instance-level contrastive learning. It also presents Cross Linear Attention Fusion Module (CLFM), a scalable fusion mechanism with linear complexity that enables bidirectional, global cross-modal interactions in BEV space. Together, ICD and CLFM yield state-of-the-art performance on KITTI multiclass 3D detection while maintaining online inference speeds around 14 FPS, and demonstrate generalization to nuScenes. The approach leverages a teacher-student framework, targeted instance-level supervision, and a kernel-based attention fusion to balance convergence across modalities and efficiently capture long-range dependencies in multimodal BEV features.

Abstract

Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.

Efficient Multimodal 3D Object Detector via Instance-Level Contrastive Distillation

TL;DR

This work tackles cross-modal heterogeneity in multimodal 3D object detection by introducing Instance-level Contrastive Distillation (ICD), which transfers spatial knowledge from a frozen LiDAR teacher to the RGB image encoder using object-aware, instance-level contrastive learning. It also presents Cross Linear Attention Fusion Module (CLFM), a scalable fusion mechanism with linear complexity that enables bidirectional, global cross-modal interactions in BEV space. Together, ICD and CLFM yield state-of-the-art performance on KITTI multiclass 3D detection while maintaining online inference speeds around 14 FPS, and demonstrate generalization to nuScenes. The approach leverages a teacher-student framework, targeted instance-level supervision, and a kernel-based attention fusion to balance convergence across modalities and efficiently capture long-range dependencies in multimodal BEV features.

Abstract

Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.

Paper Structure

This paper contains 17 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) During the early stages of training, the dual-branch encoder tends to over-rely on LiDAR feature due to its faster convergence, leading to severe trailing effects in the RGB-derived BEV feature. (b) LiDAR-based augmentation techniques cannot easily applied to a dual-branch encoder, as ensuring spatial and semantic consistency in RGB images remains challenging when attempting equivalent transformations.
  • Figure 2: Overall architecture of our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). Firstly, a LiDAR-only teacher network is pretrained and its weights are frozen. Afterwards, we train the multimodal student with ICD and CLFM. The encoded features from the 3D branch and 2D branch are fully fused via our proposed CLFM, enabling high-performance online 3D object detection.
  • Figure 3: (a) CutMix and GT sampling ensure multimodal alignment at the input stage while preserving 2D perspective occlusion. (b) Rotated context anchors query image BEV features, enabling soft alignment of 3D rotations through instance-level feature distillation.
  • Figure 4: The architecture of Cross Linear Attention Fusion Module
  • Figure 5: Qualitative results comparing our method with SOTA approaches on the multi-class KITTI 3D object detection task. Green Box represents the ground-truth bounding boxes; Red Box represents the predictions of methods; Green Circle represents the false negative objects; Red Circle represents the false postive predictions; Blue Anchor is enlarged views to display tiny targets like pedestrians. Our method demonstrates strong detection performance on small targets while exhibiting robustness against false negatives and false positives.
  • ...and 1 more figures