On the Adversarial Robustness of Camera-based 3D Object Detection

Shaoyuan Xie; Zichao Li; Zeyu Wang; Cihang Xie

On the Adversarial Robustness of Camera-based 3D Object Detection

Shaoyuan Xie, Zichao Li, Zeyu Wang, Cihang Xie

TL;DR

Camera-based 3D object detectors face substantial adversarial risk, and this work systematically evaluates monocular and BEV detectors under white-box/black-box, pixel-based, and patch-based attacks aimed at classification and localization. By adapting established 2D attacks and introducing universal patches, the study reveals that BEV representations do not guarantee robust classification but can improve localization resilience, while explicit depth supervision and temporal fusion significantly influence robustness. Key findings show that depth-estimation-free models can offer stronger defense in some scenarios, whereas precise depth estimation enhances depth-based methods, and multi-frame benign inputs can mitigate attacks. The results underscore the need for security-aware design in safety-critical deployments like autonomous driving and identify concrete strategies—such as depth supervision, temporal fusion, and prudent model scaling—to bolster robustness in camera-based 3D detection.

Abstract

In recent years, camera-based 3D object detection has gained widespread attention for its ability to achieve high performance with low computational cost. However, the robustness of these methods to adversarial attacks has not been thoroughly examined, especially when considering their deployment in safety-critical domains like autonomous driving. In this study, we conduct the first comprehensive investigation of the robustness of leading camera-based 3D object detection approaches under various adversarial conditions. We systematically analyze the resilience of these models under two attack settings: white-box and black-box; focusing on two primary objectives: classification and localization. Additionally, we delve into two types of adversarial attack techniques: pixel-based and patch-based. Our experiments yield four interesting findings: (a) bird's-eye-view-based representations exhibit stronger robustness against localization attacks; (b) depth-estimation-free approaches have the potential to show stronger robustness; (c) accurate depth estimation effectively improves robustness for depth-estimation-based methods; (d) incorporating multi-frame benign inputs can effectively mitigate adversarial attacks. We hope our findings can steer the development of future camera-based object detection models with enhanced adversarial robustness.

On the Adversarial Robustness of Camera-based 3D Object Detection

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 13 figures, 12 tables)

This paper contains 28 sections, 4 equations, 13 figures, 12 tables.

Introduction
Related Work
Camera-based 3D object detection.
Adversarial attacks on classification.
Adversarial attacks on object detection.
Camera-based 3D Object Detection
Monocular Approach
BEV Detector with Depth Estimation
BEV Detector without Depth Estimation
Generating Adversarial Examples
Pixel-based Attack
Patch-based Attack
Black-box Attack
Experiments
Experimental Setup
...and 13 more sections

Figures (13)

Figure 1: Adversarial nuScenes Detection Score (NDS) v.s. clean nuScenes Detection Score. Models that exhibit better performance on standard datasets do not necessarily exhibit better adversarial robustness.
Figure 2: Illustration of adversarial patch size adaptations, wherein the patch size is adjusted proportionally to the target's 2D bounding box dimensions. The left panel depicts a fixed-size patch, while the right panel presents a dynamically scaled patch.
Figure 3: Mean Average Precision (mAP) value v.s attack iterations. Models behave similarly under untargeted classification attacks while varying largely under localization attacks. All the models are similarly vulnerable to untargeted attacks while BEV-based exhibit better robustness toward localization attacks.
Figure 4: Left panel: The horizontal axis corresponds to the targeted model while the vertical axis denotes the source model. Transferability is quantified by the proportional reduction in performance (specifically, mAP) in comparison to a randomized patch pattern of identical size. Right panel: The pipeline of the optimization process for the universal patch.
Figure 5: Comparisons between BEV-based models and non-BEV-based models.
...and 8 more figures

On the Adversarial Robustness of Camera-based 3D Object Detection

TL;DR

Abstract

On the Adversarial Robustness of Camera-based 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (13)