Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

Zhiyuan Cheng; Hongjun Choi; James Liang; Shiwei Feng; Guanhong Tao; Dongfang Liu; Michael Zuzak; Xiangyu Zhang

Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

Zhiyuan Cheng, Hongjun Choi, James Liang, Shiwei Feng, Guanhong Tao, Dongfang Liu, Michael Zuzak, Xiangyu Zhang

TL;DR

The paper addresses the security of multi-sensor fusion for 3D object detection by showing that camera-only perturbations can effectively subvert fusion models. It introduces a two-stage, patch-based attack framework that first learns a sensitivity heatmap over image regions via joint optimization of a patch $p$ and a mask $M$ to minimize $L_{adv}$ while controlling $L_{mask}$, and then applies scene-oriented or object-oriented attacks according to whether the model is globally or object-sensitive, using projection and EoT for physical realism. Across six fusion models and one camera-only model on Nuscenes, the approach demonstrates substantial degradation in detection performance (e.g., mAP reductions to $0.353$ and target-object scores down to $0.156$) and validates practicality in simulation and real-world-like settings. The work also offers defense-oriented insights, including the influence of image backbones on sensitivity and the potential for architectural adjustments to improve robustness. Overall, the study raises important considerations for MSF security in autonomous perception and provides a deployable, two-stage framework for evaluating and enhancing robustness against camera-only threats.

Abstract

Multi-sensor fusion (MSF) is widely used in autonomous vehicles (AVs) for perception, particularly for 3D object detection with camera and LiDAR sensors. The purpose of fusion is to capitalize on the advantages of each modality while minimizing its weaknesses. Advanced deep neural network (DNN)-based fusion techniques have demonstrated the exceptional and industry-leading performance. Due to the redundant information in multiple modalities, MSF is also recognized as a general defence strategy against adversarial attacks. In this paper, we attack fusion models from the camera modality that is considered to be of lesser importance in fusion but is more affordable for attackers. We argue that the weakest link of fusion models depends on their most vulnerable modality, and propose an attack framework that targets advanced camera-LiDAR fusion-based 3D object detection models through camera-only adversarial attacks. Our approach employs a two-stage optimization-based strategy that first thoroughly evaluates vulnerable image areas under adversarial attacks, and then applies dedicated attack strategies for different fusion models to generate deployable patches. The evaluations with six advanced camera-LiDAR fusion models and one camera-only model indicate that our attacks successfully compromise all of them. Our approach can either decrease the mean average precision (mAP) of detection performance from 0.824 to 0.353, or degrade the detection score of a target object from 0.728 to 0.156, demonstrating the efficacy of our proposed attack framework. Code is available.

Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

TL;DR

and a mask

to minimize

while controlling

, and then applies scene-oriented or object-oriented attacks according to whether the model is globally or object-sensitive, using projection and EoT for physical realism. Across six fusion models and one camera-only model on Nuscenes, the approach demonstrates substantial degradation in detection performance (e.g., mAP reductions to

and target-object scores down to

) and validates practicality in simulation and real-world-like settings. The work also offers defense-oriented insights, including the influence of image backbones on sensitivity and the potential for architectural adjustments to improve robustness. Overall, the study raises important considerations for MSF security in autonomous perception and provides a deployable, two-stage framework for evaluating and enhancing robustness against camera-only threats.

Abstract

Paper Structure (29 sections, 8 equations, 22 figures, 9 tables)

This paper contains 29 sections, 8 equations, 22 figures, 9 tables.

Introduction
Related Work
Motivation
Method
Evaluation
Sensitivity Distribution Recognition
Scene-oriented Attacks
Object-oriented Attacks
Practicality
Conclusion
Ethics Statement
Acknowledgements
Varying Significance of Different Modalities in Fusion
Discussion of other fusion strategies
General Architecture of Camera-LiDAR Fusion
...and 14 more sections

Figures (22)

Figure 1: Single-modal attacks against camera-LiDAR fusion model using camera-modality.
Figure 2: Motivating example of adversarial patch attack on images against fusion models.
Figure 3: Framework of single-modal attacks against camera-LiDAR fusion model with adversarial patches.
Figure 4: Projections in different attack strategies.
Figure 5: Sensitivity heatmaps of six camera-LiDAR fusion models and a camera-only model on two scenes.
...and 17 more figures

Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

TL;DR

Abstract

Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (22)