PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts
Kun Guo, Qiang Ling
TL;DR
PromptDet addresses the efficiency gap in multi-modal 3D object detection by introducing a LiDAR-assisted prompter that can fuse LiDAR with camera features at multiple scales and transfer the fused knowledge to the camera detector in a single-stage training regime. The Adaptive Hierarchical Aggregation (AHA) fuses multi-scale LiDAR and image features into a unified BEV representation, while Cross-Modal Knowledge Injection (CMKI) transfers this fused knowledge to the camera branch using an imitation module and dedicated distillation losses. On nuScenes, PromptDet yields substantial gains with LiDAR input (up to 22.8% mAP, 21.1% NDS) and notable improvements even in camera-only inference (up to 2.4% mAP, 4.0% NDS) with minimal parameter overhead. This approach offers a deployment-friendly, single-stage training pathway for robust multi-modal 3D perception and can be extended to other multi-camera perception tasks.
Abstract
Multi-camera 3D object detection aims to detect and localize objects in 3D space using multiple cameras, which has attracted more attention due to its cost-effectiveness trade-off. However, these methods often struggle with the lack of accurate depth estimation caused by the natural weakness of the camera in ranging. Recently, multi-modal fusion and knowledge distillation methods for 3D object detection have been proposed to solve this problem, which are time-consuming during the training phase and not friendly to memory cost. In light of this, we propose PromptDet, a lightweight yet effective 3D object detection framework motivated by the success of prompt learning in 2D foundation model. Our proposed framework, PromptDet, comprises two integral components: a general camera-based detection module, exemplified by models like BEVDet and BEVDepth, and a LiDAR-assisted prompter. The LiDAR-assisted prompter leverages the LiDAR points as a complementary signal, enriched with a minimal set of additional trainable parameters. Notably, our framework is flexible due to our prompt-like design, which can not only be used as a lightweight multi-modal fusion method but also as a camera-only method for 3D object detection during the inference phase. Extensive experiments on nuScenes validate the effectiveness of the proposed PromptDet. As a multi-modal detector, PromptDet improves the mAP and NDS by at most 22.8\% and 21.1\% with fewer than 2\% extra parameters compared with the camera-only baseline. Without LiDAR points, PromptDet still achieves an improvement of at most 2.4\% mAP and 4.0\% NDS with almost no impact on camera detection inference time.
