Table of Contents
Fetching ...

A Black-Box Evaluation Framework for Semantic Robustness in Bird's Eye View Detection

Fu Wang, Yanghao Zhang, Xiangyu Yin, Guangliang Cheng, Zeyu Fu, Xiaowei Huang, Wenjie Ruan

TL;DR

The paper tackles the problem of evaluating worst-case robustness of camera-based BEV detectors under semantic perturbations in a black-box setting. It introduces a distance-based surrogate objective that aligns with BEV box matching and a deterministic global optimizer, SimpleDIRECT, which uses a simplified node-selection strategy to efficiently locate adversarial perturbations. Through extensive experiments on nuScenes across ten BEV models, the framework demonstrates superior ability to reveal vulnerabilities compared to random perturbations and baseline optimizers, with PolarFormer showing the strongest robustness and BEVDet being highly susceptible. A full validation-set case study confirms the framework’s practical utility and highlights how temporal information can influence robustness, underscoring the need for robust BEV designs in real-world autonomous driving systems.

Abstract

Camera-based Bird's Eye View (BEV) perception models receive increasing attention for their crucial role in autonomous driving, a domain where concerns about the robustness and reliability of deep learning have been raised. While only a few works have investigated the effects of randomly generated semantic perturbations, aka natural corruptions, on the multi-view BEV detection task, we develop a black-box robustness evaluation framework that adversarially optimises three common semantic perturbations: geometric transformation, colour shifting, and motion blur, to deceive BEV models, serving as the first approach in this emerging field. To address the challenge posed by optimising the semantic perturbation, we design a smoothed, distance-based surrogate function to replace the mAP metric and introduce SimpleDIRECT, a deterministic optimisation algorithm that utilises observed slopes to guide the optimisation process. By comparing with randomised perturbation and two optimisation baselines, we demonstrate the effectiveness of the proposed framework. Additionally, we provide a benchmark on the semantic robustness of ten recent BEV models. The results reveal that PolarFormer, which emphasises geometric information from multi-view images, exhibits the highest robustness, whereas BEVDet is fully compromised, with its precision reduced to zero.

A Black-Box Evaluation Framework for Semantic Robustness in Bird's Eye View Detection

TL;DR

The paper tackles the problem of evaluating worst-case robustness of camera-based BEV detectors under semantic perturbations in a black-box setting. It introduces a distance-based surrogate objective that aligns with BEV box matching and a deterministic global optimizer, SimpleDIRECT, which uses a simplified node-selection strategy to efficiently locate adversarial perturbations. Through extensive experiments on nuScenes across ten BEV models, the framework demonstrates superior ability to reveal vulnerabilities compared to random perturbations and baseline optimizers, with PolarFormer showing the strongest robustness and BEVDet being highly susceptible. A full validation-set case study confirms the framework’s practical utility and highlights how temporal information can influence robustness, underscoring the need for robust BEV designs in real-world autonomous driving systems.

Abstract

Camera-based Bird's Eye View (BEV) perception models receive increasing attention for their crucial role in autonomous driving, a domain where concerns about the robustness and reliability of deep learning have been raised. While only a few works have investigated the effects of randomly generated semantic perturbations, aka natural corruptions, on the multi-view BEV detection task, we develop a black-box robustness evaluation framework that adversarially optimises three common semantic perturbations: geometric transformation, colour shifting, and motion blur, to deceive BEV models, serving as the first approach in this emerging field. To address the challenge posed by optimising the semantic perturbation, we design a smoothed, distance-based surrogate function to replace the mAP metric and introduce SimpleDIRECT, a deterministic optimisation algorithm that utilises observed slopes to guide the optimisation process. By comparing with randomised perturbation and two optimisation baselines, we demonstrate the effectiveness of the proposed framework. Additionally, we provide a benchmark on the semantic robustness of ten recent BEV models. The results reveal that PolarFormer, which emphasises geometric information from multi-view images, exhibits the highest robustness, whereas BEVDet is fully compromised, with its precision reduced to zero.

Paper Structure

This paper contains 36 sections, 2 theorems, 15 equations, 6 figures, 7 tables, 2 algorithms.

Key Result

Lemma 1

Let $\mathcal{L}_{\max}$ denote the current best query result, and $H$ represent the depth of the partition tree. Given a node set $\vartheta = \bigcup_{h=1}^{H}\vartheta_h$ and a positive tolerance $\epsilon > 0$. For any node $\Theta_p$ at depth $h$, we can define three sets as follows: $\mathcal{ where $\theta_a$ denotes the centre of node $\Theta_a$, and the following inequations hold:

Figures (6)

  • Figure 1: Fig. (a) visualises the space trisection strategy, where the dashed red lines represent where the slopes can be obtained. Fig. (b) demonstrates the redundancy introduced by Eq. \ref{['eqn:po_cond1']}, which often qualifies notably more nodes than Eq. \ref{['eqn:po_cond2']}. Fig. (c) illustrates the difference between Eq. \ref{['eqn:po_cond1']} (green lines) and Eq. \ref{['eqn:simple_cond']} (dashed red lines).
  • Figure 2: A illustration of the impact of semantic perturbations operated at different strengths on five BEV perception models. The effects are quantified regarding the number of matches (# Match) and the distance metric defined in Eq. \ref{['eqn:final_object_func']}. Each figure starts with clean input frames, while we set the following perturbations to $\gamma \in [0.1, 0.4]$ for colour, $\gamma \in [0.04, 0.1]$ for geometry, and kernel sizes ${5, 7, 9, 11}$ for motion blur. The shaded area indicates the standard deviations of the models' performance.
  • Figure 3: An illustration of the first three iterations of solving a 2-D maximisation problem.
  • Figure 4: A illustration of the impact of semantic perturbations operated at different strengths on five BEV perception models. The effects are quantified in terms of the number of matches (# Match) and the distance metric defined in Eq. \ref{['eqn:final_object_func']}. The dashed red line represents the average number of ground truth boxes (# GT) across the sampled frames. The standard deviation of the models' performance is indicated by the shaded area surrounding each line.
  • Figure 5: A showcase of conducting DIRECT at depth $H \in \{4,6,8,10,12\}$ on four semantic perturbations. The intensity of brightness and colour perturbations is carried out at $\gamma = 0.3$, while both geometric perturbations are upper bounded by $\gamma = 0.1$, and the kernel size of motion blur perturbation is fixed at 9.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma 1: Potential Optimal nodes Gablonsky01
  • Theorem 2: WangXRH23