Table of Contents
Fetching ...

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

Ziying Song, Guoxing Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Caiyan Jia, Feiyang Jia, Li Wang

TL;DR

RoboFusion tackles the vulnerability of multi-modal 3D object detectors to OOD noise in autonomous driving by integrating Visual Foundation Models (VFMs) with LiDAR-camera fusion. It introduces SAM-AD for AD-specialized image features, an AD-FPN for multi-scale fusion, a Depth-Guided Wavelet Attention (DGWA) to denoise depth-guided images, and Adaptive Fusion with self-attention to reweight multimodal features. Empirically, RoboFusion achieves state-of-the-art performance on clean KITTI/nuScenes and demonstrates superior robustness on KITTI-C and nuScenes-C under diverse weather and sensor corruptions, including substantial gains in weather-related noise (e.g., $AP_{Weather}$ improvements). The work highlights the practical impact of incorporating VFMs into AD perception, offering a foundation for robust, real-world deployment, while acknowledging trade-offs in speed for larger VFMs and pointing to future work on speed-optimized training-only guidance.

Abstract

Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion.

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

TL;DR

RoboFusion tackles the vulnerability of multi-modal 3D object detectors to OOD noise in autonomous driving by integrating Visual Foundation Models (VFMs) with LiDAR-camera fusion. It introduces SAM-AD for AD-specialized image features, an AD-FPN for multi-scale fusion, a Depth-Guided Wavelet Attention (DGWA) to denoise depth-guided images, and Adaptive Fusion with self-attention to reweight multimodal features. Empirically, RoboFusion achieves state-of-the-art performance on clean KITTI/nuScenes and demonstrates superior robustness on KITTI-C and nuScenes-C under diverse weather and sensor corruptions, including substantial gains in weather-related noise (e.g., improvements). The work highlights the practical impact of incorporating VFMs into AD perception, offering a foundation for robust, real-world deployment, while acknowledging trade-offs in speed for larger VFMs and pointing to future work on speed-optimized training-only guidance.

Abstract

Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion.
Paper Structure (35 sections, 3 equations, 5 figures, 15 tables)

This paper contains 35 sections, 3 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: (a) We employ Gaussian distributions to represent the distributional disparities among the datasets. Indeed, there exists a large gap in data distribution between an OOD noise validation set and a clean validation set. Where the X-axis represents the set of mean pixel values in a dataset, $X = \{x_{i} \,|\, i=1,2,...,N\}$, with $x_{i} = \frac{1}{H \times W \times 3} \sum_{i=1}^{H} \sum_{j=1}^{W} \sum_{k=1}^{3}(I_{ijk})$, where $N$ is the number of the dataset, $H$ is the height, $W$ is the width, and $I_{ijk}$ denotes the pixel values for each image. (b) Visual foundation models (VFMs) like SAM sam, show robust performance in many noisy scenarios. Yet, the current methods are not robust enough to predict 3D tasks for autonomous driving perception. (c) To this end, we propose a robust framework, RoboFusion, which employs VFMs at the SOTA multi-modal 3D object detection. Empirical results reveal that our method surpasses the Top-performing LoGoNetlogonet on the KITTI Leaderboard by a margin of 23.12% mAP (Weather) on KITTI-C Robustness3d noisy scenarios. Notably, our RoboFusion shows better performance with LoGoNet logonet in clean KITTI kitti dataset.
  • Figure 2: The framework of RoboFusion. The LiDAR branch follows the baselines focalconvtransfusion to generate LiDAR features. In the camera branch, first, we extract robust image features using a highly optimized SAM-AD and acquire multi-scale features using AD-FPN. Second, the sparse depth map $S$ is generated by the raw points and fed into a depth encoder to obtain depth features and fused with multi-scale image features $F_i$ to obtain depth-guided image features $\hat{F}_i$. Then wave attention is used to remove the mutation noise. Finally, adaptive Fusion integrates point cloud features with robust image features with depth information via self-attention mechanism.
  • Figure 3: An illustration of the pre-training framework. We corrupt a clean image $x$ by $\eta$ which contains multiple weather noises and then randomly masking several patches on a noisy image $x + \eta$ to obtain a masked noisy image $Mask(x+\eta)$. The SAM-AD and DMAE decoder are trained to reconstruct the clean image $\hat{x}$ from $Mask(x+\eta)$.
  • Figure 4: The architecture of Adaptive Fusion, which involves adaptively re-weighting the fused features using self-attention.
  • Figure 5: Visualization Results of LoGoNet and our RoboFusion in KITTI-C dataset. We use boxes in red to represent false positives, green boxes for truth positives, and black for the ground truth. We use blue dashed ovals to highlight the pronounced improvements in predictions.