Table of Contents
Fetching ...

PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts

Kun Guo, Qiang Ling

TL;DR

PromptDet addresses the efficiency gap in multi-modal 3D object detection by introducing a LiDAR-assisted prompter that can fuse LiDAR with camera features at multiple scales and transfer the fused knowledge to the camera detector in a single-stage training regime. The Adaptive Hierarchical Aggregation (AHA) fuses multi-scale LiDAR and image features into a unified BEV representation, while Cross-Modal Knowledge Injection (CMKI) transfers this fused knowledge to the camera branch using an imitation module and dedicated distillation losses. On nuScenes, PromptDet yields substantial gains with LiDAR input (up to 22.8% mAP, 21.1% NDS) and notable improvements even in camera-only inference (up to 2.4% mAP, 4.0% NDS) with minimal parameter overhead. This approach offers a deployment-friendly, single-stage training pathway for robust multi-modal 3D perception and can be extended to other multi-camera perception tasks.

Abstract

Multi-camera 3D object detection aims to detect and localize objects in 3D space using multiple cameras, which has attracted more attention due to its cost-effectiveness trade-off. However, these methods often struggle with the lack of accurate depth estimation caused by the natural weakness of the camera in ranging. Recently, multi-modal fusion and knowledge distillation methods for 3D object detection have been proposed to solve this problem, which are time-consuming during the training phase and not friendly to memory cost. In light of this, we propose PromptDet, a lightweight yet effective 3D object detection framework motivated by the success of prompt learning in 2D foundation model. Our proposed framework, PromptDet, comprises two integral components: a general camera-based detection module, exemplified by models like BEVDet and BEVDepth, and a LiDAR-assisted prompter. The LiDAR-assisted prompter leverages the LiDAR points as a complementary signal, enriched with a minimal set of additional trainable parameters. Notably, our framework is flexible due to our prompt-like design, which can not only be used as a lightweight multi-modal fusion method but also as a camera-only method for 3D object detection during the inference phase. Extensive experiments on nuScenes validate the effectiveness of the proposed PromptDet. As a multi-modal detector, PromptDet improves the mAP and NDS by at most 22.8\% and 21.1\% with fewer than 2\% extra parameters compared with the camera-only baseline. Without LiDAR points, PromptDet still achieves an improvement of at most 2.4\% mAP and 4.0\% NDS with almost no impact on camera detection inference time.

PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts

TL;DR

PromptDet addresses the efficiency gap in multi-modal 3D object detection by introducing a LiDAR-assisted prompter that can fuse LiDAR with camera features at multiple scales and transfer the fused knowledge to the camera detector in a single-stage training regime. The Adaptive Hierarchical Aggregation (AHA) fuses multi-scale LiDAR and image features into a unified BEV representation, while Cross-Modal Knowledge Injection (CMKI) transfers this fused knowledge to the camera branch using an imitation module and dedicated distillation losses. On nuScenes, PromptDet yields substantial gains with LiDAR input (up to 22.8% mAP, 21.1% NDS) and notable improvements even in camera-only inference (up to 2.4% mAP, 4.0% NDS) with minimal parameter overhead. This approach offers a deployment-friendly, single-stage training pathway for robust multi-modal 3D perception and can be extended to other multi-camera perception tasks.

Abstract

Multi-camera 3D object detection aims to detect and localize objects in 3D space using multiple cameras, which has attracted more attention due to its cost-effectiveness trade-off. However, these methods often struggle with the lack of accurate depth estimation caused by the natural weakness of the camera in ranging. Recently, multi-modal fusion and knowledge distillation methods for 3D object detection have been proposed to solve this problem, which are time-consuming during the training phase and not friendly to memory cost. In light of this, we propose PromptDet, a lightweight yet effective 3D object detection framework motivated by the success of prompt learning in 2D foundation model. Our proposed framework, PromptDet, comprises two integral components: a general camera-based detection module, exemplified by models like BEVDet and BEVDepth, and a LiDAR-assisted prompter. The LiDAR-assisted prompter leverages the LiDAR points as a complementary signal, enriched with a minimal set of additional trainable parameters. Notably, our framework is flexible due to our prompt-like design, which can not only be used as a lightweight multi-modal fusion method but also as a camera-only method for 3D object detection during the inference phase. Extensive experiments on nuScenes validate the effectiveness of the proposed PromptDet. As a multi-modal detector, PromptDet improves the mAP and NDS by at most 22.8\% and 21.1\% with fewer than 2\% extra parameters compared with the camera-only baseline. Without LiDAR points, PromptDet still achieves an improvement of at most 2.4\% mAP and 4.0\% NDS with almost no impact on camera detection inference time.

Paper Structure

This paper contains 31 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of our PromptDet with previous detection frameworks. (a) Multi-modal detection needs a more complex network architecture. The model training is time-consuming and occupies a huge memory cost. (b) Though knowledge distillation brings performance gains to camera-only detection, a teacher model needs to be trained first and the whole process is laborious and formidable. (c) Our method uses the LiDAR modality as a flexible prompt with a few additional parameters. PromptDet can conduct multi-modal detection and camera-only detection with better performance than the baseline.
  • Figure 2: The overview of our proposed PrompDet. The model is composed of a camera-only detector and the LiDAR-assisted prompter including Adaptive Hierarchical Aggregation (AHA) and Cross-Modal Knowledge Injection (CMKI). During model training, LiDAR modality switch is turned off. Multi-modal fusion and online knowledge transfer are performed at the same time. PromptDet supports both LiDAR-camera detection and camera-only detection when inference.
  • Figure 3: Illustration of Adaptive Hierarchical Aggregation (AHA). LiDAR voxel features and camera pseudo voxel features are fused at different voxel scales, as shown in (a). Hierarchical fusion features are aggregated in a flexible way to obtain the output BEV features, as shown in (b).
  • Figure 4: Illustration of data distribution change during PromptDet training. LiDAR-camera detection and camera-only detection are both supervised by the ground truth. If fusion features are not detached in Cross-Modal Knowledge Injection (CMKI), they are also supervised by camera-only ones, which leads to inferior detection performance.
  • Figure 5: Illustration to show the effect of Cross-Modal Knowledge Injection (CMKI). With CMKI working, the data distribution difference between fusion features and camera features narrows down and all three kinds of CMKI loss decrease slowly.
  • ...and 3 more figures