Table of Contents
Fetching ...

CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection

Yuchen Wu, Kun Wang, Yining Pan, Na Zhao

Abstract

Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.

CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection

Abstract

Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.
Paper Structure (16 sections, 4 equations, 5 figures, 4 tables)

This paper contains 16 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Qualitative examples show that LiDAR and camera modalities degrade differently under adverse conditions. (b) Quantitative results on a baseline dual-branch detector show that our method substantially improves camera-originated queries and narrows their performance gap to LiDAR-originated queries.
  • Figure 2: Analysis of 2D proposal quality. We compare the 2D mAP@50 of proposals from the 2D detector (Faster R-CNN) against projected 3D boxes from the 3D detector (ISFusion). The results show that native 2D proposals consistently outperform projected 3D proposals across all domains.
  • Figure 3: Overview of CCF. CCF addresses modality imbalance with three components. (a) Query Decoupled Loss uses three parallel, weight-shared decoder passes (2D-only, 3D-only, and fused) to provide modality-specific supervision while avoiding shortcut learning. (b) LiDAR-Guided Depth Prior adaptively fuses image-predicted and LiDAR-derived depth distributions to improve 2D query initialization. (c) Complementary Cross-Modal Masking applies complementary spatial masking, encouraging balanced competition between camera- and LiDAR-originated queries. Together, these components improve modality balance and robustness under domain shift.
  • Figure 4: Illustration of our LiDAR-Guided Depth Prior. For each 2D proposal, we extract a learned depth distribution from image features ($\mathbf{d}^{2d}$) and a geometric prior from LiDAR points ($\mathbf{d}^{3d}$). A confidence network adaptively predicts a fusion weight ($\lambda$) to combine them into a fused distribution ($\mathbf{d}^{Fused}$), which provides robust depth initialization for the 2D query.
  • Figure 5: Examples of 3D object detections on different data splits. We visualize the 3D bounding boxes of car, truck and pedestrian with orange, magenta and blue colors in the multi-view images.