Table of Contents
Fetching ...

DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection

Feiyang Jia, Caiyan Jia, Ailin Liu, Shaoqing Xu, Qiming Xia, Lin Liu, Lei Yang, Yan Gong, Ziying Song

TL;DR

DGFusion tackles hard instance detection in multi-modal 3D object detection by introducing a Dual-guided paradigm that unifies Point-guide-Image and Image-guide-Point approaches. It builds instance-level features via IFG, then uses DIPM to create easy and hard instance pairs, enabling two complementary fusion paths through PGIE and IGPE before final detection. The approach yields consistent gains on nuScenes (e.g., +1.0% mAP, +0.8% NDS on the test set) and demonstrates robustness across distance, visibility, and small object sizes, with competitive latency compared to strong baselines. This framework provides a scalable, geometry-aware fusion strategy that improves reliability in challenging autonomous driving scenarios and can extend to other unified BEV-based detectors.

Abstract

As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection

TL;DR

DGFusion tackles hard instance detection in multi-modal 3D object detection by introducing a Dual-guided paradigm that unifies Point-guide-Image and Image-guide-Point approaches. It builds instance-level features via IFG, then uses DIPM to create easy and hard instance pairs, enabling two complementary fusion paths through PGIE and IGPE before final detection. The approach yields consistent gains on nuScenes (e.g., +1.0% mAP, +0.8% NDS on the test set) and demonstrates robustness across distance, visibility, and small object sizes, with competitive latency compared to strong baselines. This framework provides a scalable, geometry-aware fusion strategy that improves reliability in challenging autonomous driving scenarios and can extend to other unified BEV-based detectors.

Abstract

As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

Paper Structure

This paper contains 30 sections, 4 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) The information density gap is a distinctive characteristic of certain distant, occluded, or small-scale targets. This phenomenon manifests as either poor point cloud data but rich pixel information (red dashed circle) or the converse scenario (green dashed circle). Most existing research focuses on a single case. (b) The number of point clouds from all annotations in the nuScenes training and validation sets is counted using the visibility tokens as a classification benchmark to demonstrate the generalization of the two phenomena mentioned above. Notably, even among objects with the highest visibility (token=4), over 20% exhibits either zero or merely one LiDAR point. Conversely, a significant portion of objects with the lowest visibility (token=1) still retain rich point cloud data. Statement: 1) Picture from nuScenesnuscenes, sample token: a771effa2a2648d78096c3e92b95b129, visualization and data statistics were implemented via Python SDK nuScenes DevKitnuscenes. 2) For the key frames of the nuScenes LiDAR point clouds, the number of points falling within the bounding boxes of GT (ground-truth ) annotations is recorded under the attribute name 'num_lidar_pts' - the value we count. 3) The visibility token, an attribute within the nuScenes annotations, quantifies the visibility level of targets in camera data, categorized as follows: 1 (0%–40%), 2 (40%–60%), 3 (60%–80%), and 4 (80%–100%).
  • Figure 2: The paradigms of multi-modal 3D object detection methods in autonomous driving and the performance of our new paradigm. (a) The Image-guide-Point paradigm obtains 2D feature information by human-designed elements to guide 3D feature information. (b) The Point-guide-Image paradigm acquires point cloud-dominated instance-level features in a BEV unified space to transfer semantic and geometric information. (c) The Dual-guided paradigm we propose can sensitively capture the information density gap between different modalities. (d) The DGFusion designed based on the Dual-guided paradigm, demonstrates exceptional robustness, without requiring additional training epochs. Specifically, DGFusion's inference results on objects of varying distance (top left), visibility (top right), and size (bottom left) validate its effectiveness in mitigating hard instance detection challenges. Furthermore, all models generated by DGFusion using the nuScenes small-scale training dataset consistently exhibit stable and superior performance on the validation set (bottom right).
  • Figure 3: DGFusion Framework. (a) First, we extract BEV features by structured integration of LiDAR and camera data. (b) The Instance Match Modules contains: (i) Instance-level Features Generator (IFG) that produces multi-modal instances, and (ii) Difficulty-aware Instance Pair Matcher (DIPM) that matches Easy Instance Pairs (EIP) and two Hard Instance Pairs types (C-HIP and L-HIP). (c) The Dual-guided Modules then performs: (i) Point-guide-Image Enhancement (PGIE) to enhance Camera BEV space using EIP and C-HIP, and (ii) Image-guide-Point Enhancement (IGPE) to enhance Camera BEV space using L-HIP. (d) Finally, we concatenate the enhanced BEV features and generate 3D detection results with a dense detection head.
  • Figure 4: The process of generating instance-level features through sampling involves generating proposals directly from the BEV feature maps through the additional prediction head, project the obtained proposals onto the 2D space, acquire the center points of the proposals and the midpoints of each boundary line, and concatenate the features of these five key sample features to form enriched instance features.
  • Figure 5: Pipeline of Difficulty-aware Instance Pair Matcher. The DIPM operates in two sequential stages. Stage 1 selects cross-modally consistent easy instance pairs (EIPs) through LiDAR-dominated IoU matching, while Stage 2 constructs both camera-hard instance pairs (C-HIPs) and LiDAR-hard instance pairs (L-HIPs) by computing intra-modality feature similarities.
  • ...and 1 more figures