Table of Contents
Fetching ...

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

Abstract

Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect. The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.7%.

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Abstract

Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect. The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.7%.

Paper Structure

This paper contains 13 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: LiDAR points from the distant wall (cyan) are incorrectly projected onto the foreground vehicle (which should appear red) due to a sharp depth change. In contrast, the boundary between background objects like the wall (light cyan) and garage (dark cyan) is correctly projected thanks to a gradual depth transition. This shows that misalignment is most severe at abrupt foreground-background boundaries. And our motivation is to rectify misaligned points while maintaining already aligned points.
  • Figure 2: Overview of our proposed framework. (a) Prior Guided Depth Calibration (PGDC) and (b) Discontinuity Aware Geometric Fusion (DAGF), proactively mitigate multi-sensor feature misalignment before view transformation. And (c) Structural Guidance Depth Modulator (SGDM) intelligently fuses image features and dense geometric representation, predicting an accurate depth distribution. Finally, fusing rectified camera BEV features with LiDAR BEV features leads to robust 3D detection.
  • Figure 3: The Prior Guided Depth Calibration (PGDC) module leverages 2D detection boxes as priors to precisely target and correct the most severe feature misalignments, which are caused by calibration errors and motion distortion. By applying localized smoothing to the point cloud within these detected regions, the module corrects the erroneous depth information. Simultaneously, it enhances the features of these critical image areas.
  • Figure 4: (a) Represents the original projected depth. (b) Represents the projected depth after applying Discrepancy Masking. (c) Shows the block-wise depth map after Block-based Densification; (d), (e), and (f) are the final depth change magnitude maps of different methods after Block-based Gradient Extraction. (g), (h) and (i) are visualization of detection results, in which green boxes are True Positives (TP), solid red boxes are False Positives (FP), and dashed red boxes are False Negatives (FN). It is obvious that our method (f) outperforms GraphBEV (e) and BEVFusion (d) because we accurately delineated regions with drastic depth variations and avoided over-smoothing.