Table of Contents
Fetching ...

PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector

Kaidong Li, Tianxiao Zhang, Kuan-Chuan Peng, Guanghui Wang

TL;DR

PF3Det tackles data-efficient 3D detection by fusing LiDAR and camera information through a prompted foundation framework. It introduces a foundational branch that extracts cross-modal features via a foundation-model encoder and a multi-modal soft-prompt adapter that modulates BEV features to bridge modality gaps. On nuScenes with only ~5% of training data, PF3Det achieves $NDS$ improvements of $1.19\%$ and $mAP$ improvements of $2.42\%$, validating its effectiveness and data efficiency. The work further provides guidance on foundation encoder selection and prompt configurations to balance performance gains with parameter cost, highlighting promising directions for multimodal 3D perception in autonomous systems.

Abstract

3D object detection is crucial for autonomous driving, leveraging both LiDAR point clouds for precise depth information and camera images for rich semantic information. Therefore, the multi-modal methods that combine both modalities offer more robust detection results. However, efficiently fusing LiDAR points and images remains challenging due to the domain gaps. In addition, the performance of many models is limited by the amount of high quality labeled data, which is expensive to create. The recent advances in foundation models, which use large-scale pre-training on different modalities, enable better multi-modal fusion. Combining the prompt engineering techniques for efficient training, we propose the Prompted Foundational 3D Detector (PF3Det), which integrates foundation model encoders and soft prompts to enhance LiDAR-camera feature fusion. PF3Det achieves the state-of-the-art results under limited training data, improving NDS by 1.19% and mAP by 2.42% on the nuScenes dataset, demonstrating its efficiency in 3D detection.

PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector

TL;DR

PF3Det tackles data-efficient 3D detection by fusing LiDAR and camera information through a prompted foundation framework. It introduces a foundational branch that extracts cross-modal features via a foundation-model encoder and a multi-modal soft-prompt adapter that modulates BEV features to bridge modality gaps. On nuScenes with only ~5% of training data, PF3Det achieves improvements of and improvements of , validating its effectiveness and data efficiency. The work further provides guidance on foundation encoder selection and prompt configurations to balance performance gains with parameter cost, highlighting promising directions for multimodal 3D perception in autonomous systems.

Abstract

3D object detection is crucial for autonomous driving, leveraging both LiDAR point clouds for precise depth information and camera images for rich semantic information. Therefore, the multi-modal methods that combine both modalities offer more robust detection results. However, efficiently fusing LiDAR points and images remains challenging due to the domain gaps. In addition, the performance of many models is limited by the amount of high quality labeled data, which is expensive to create. The recent advances in foundation models, which use large-scale pre-training on different modalities, enable better multi-modal fusion. Combining the prompt engineering techniques for efficient training, we propose the Prompted Foundational 3D Detector (PF3Det), which integrates foundation model encoders and soft prompts to enhance LiDAR-camera feature fusion. PF3Det achieves the state-of-the-art results under limited training data, improving NDS by 1.19% and mAP by 2.42% on the nuScenes dataset, demonstrating its efficiency in 3D detection.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The illustration of the PF3Det pipeline. The blue components are the proposed modules.
  • Figure 2: The architecture of our proposed PF3Det. The Foundational branch is added in parallel with the original image backbone. Multi-modal soft-prompt adapter is inserted at the BEV feature level.
  • Figure 3: Foundational Point Encoder. The point features are upsampled to match original feature dimensions.
  • Figure 4: Multi-level multi-modal soft-prompt adapter. Three levels with four sets of soft prompts are tested. And weights after the first prompts are set to be learnable to better fuse features and prompts.