On the Adversarial Robustness of Camera-based 3D Object Detection
Shaoyuan Xie, Zichao Li, Zeyu Wang, Cihang Xie
TL;DR
Camera-based 3D object detectors face substantial adversarial risk, and this work systematically evaluates monocular and BEV detectors under white-box/black-box, pixel-based, and patch-based attacks aimed at classification and localization. By adapting established 2D attacks and introducing universal patches, the study reveals that BEV representations do not guarantee robust classification but can improve localization resilience, while explicit depth supervision and temporal fusion significantly influence robustness. Key findings show that depth-estimation-free models can offer stronger defense in some scenarios, whereas precise depth estimation enhances depth-based methods, and multi-frame benign inputs can mitigate attacks. The results underscore the need for security-aware design in safety-critical deployments like autonomous driving and identify concrete strategies—such as depth supervision, temporal fusion, and prudent model scaling—to bolster robustness in camera-based 3D detection.
Abstract
In recent years, camera-based 3D object detection has gained widespread attention for its ability to achieve high performance with low computational cost. However, the robustness of these methods to adversarial attacks has not been thoroughly examined, especially when considering their deployment in safety-critical domains like autonomous driving. In this study, we conduct the first comprehensive investigation of the robustness of leading camera-based 3D object detection approaches under various adversarial conditions. We systematically analyze the resilience of these models under two attack settings: white-box and black-box; focusing on two primary objectives: classification and localization. Additionally, we delve into two types of adversarial attack techniques: pixel-based and patch-based. Our experiments yield four interesting findings: (a) bird's-eye-view-based representations exhibit stronger robustness against localization attacks; (b) depth-estimation-free approaches have the potential to show stronger robustness; (c) accurate depth estimation effectively improves robustness for depth-estimation-based methods; (d) incorporating multi-frame benign inputs can effectively mitigate adversarial attacks. We hope our findings can steer the development of future camera-based object detection models with enhanced adversarial robustness.
