Table of Contents
Fetching ...

SoK: Pitfalls in Evaluating Black-Box Attacks

Fnu Suya, Anshuman Suri, Tingwei Zhang, Jingtao Hong, Yuan Tian, David Evans

TL;DR

This SoK introduces a threat-model taxonomy for black-box attacks on image classifiers, organizing attacks along axes of interactive query access, API feedback granularity, auxiliary data quality/quantity, and pretrained-model availability. By surveying 164 attacks and analyzing them within identical threat spaces, the authors reveal many under-explored threat spaces, demonstrate stronger baselines within the same threat model, and highlight connections to model extraction and inversion. They advocate runtime-aware evaluation to reflect realistic attacker costs and propose a modular codebase to standardize evaluations, enabling more realistic and diverse testing of attacks and defenses. The paper emphasizes careful baseline comparisons, diverse and harder evaluation settings, and closer integration with related areas to better understand the real-world threat landscape.

Abstract

Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

SoK: Pitfalls in Evaluating Black-Box Attacks

TL;DR

This SoK introduces a threat-model taxonomy for black-box attacks on image classifiers, organizing attacks along axes of interactive query access, API feedback granularity, auxiliary data quality/quantity, and pretrained-model availability. By surveying 164 attacks and analyzing them within identical threat spaces, the authors reveal many under-explored threat spaces, demonstrate stronger baselines within the same threat model, and highlight connections to model extraction and inversion. They advocate runtime-aware evaluation to reflect realistic attacker costs and propose a modular codebase to standardize evaluations, enabling more realistic and diverse testing of attacks and defenses. The paper emphasizes careful baseline comparisons, diverse and harder evaluation settings, and closer integration with related areas to better understand the real-world threat landscape.

Abstract

Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.
Paper Structure (43 sections, 6 figures, 2 tables)

This paper contains 43 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of top-k attacks. Square: top-k is our proposed adaption of the Square Attack for the top-k setting. NES: top-k is the current state-of-the-art attack. SignFlip chen2020boosting is a more restrictive hard-label attack.
  • Figure 2: ASR (y-axis) for various targeted attacks on DenseNet201 models, varying across iterations (a) and time (b). All attacks on the left are run for 100 iterations, while attacks on the right are run for 30 minutes per batch. ASR at each iteration is computed using adversarial examples at that iteration. ASR at 40 iterations are marked with $\star$ for each attack.
  • Figure 3: Attack success rates (ASR) (y-axis, left) for target and local models, along with loss (y-axis, right) while optimizing the objective locally, varying across time (x-axis), for targeted attacks on DenseNet201 (a) and untargeted attacks on adversarially-robust Inception-v3$_\text{adv}$ (b), using SMIMI-FGSM wang2022enhancing. ASR at representative iterations (40 for targeted, 10 for untargeted) are marked with a $\star$ for each of the metrics.
  • Figure 4: ASR (y-axis) for various attacks: targeted attacks for Inception-v3 with perturbation budget $16/255$ ($\ell_\infty$) (a), untargeted attacks for Inception-v3 with reduced perturbation budget $8/255$ (b), and untargeted attacks for adversarially robust model Inc-v3$_{\text{adv}}$ with perturbation budget $16/255$ (c). ASR at each iteration is computed using adversarial examples at that iteration. ASR at representative iterations (40 for targeted, 10 for untargeted) are marked with $\star$ for each attack.
  • Figure 5: ASR (y-axis) for various attacks varying across time: targeted attacks for VGG19 (a) and Resnet101 (b), and untargeted attacks for IncRes-v2$_{\text{ens}}$ (c). ASR at each iteration is computed using adversarial examples at that iteration. ASR at representative (40 for targeted, 10 for untargeted) are marked with $\star$ for each attack. Note that although SMIMI-FGSM seems to outperform other attacks in most settings, it is outperformed by VMI-FGSM and VNI-FGSM for the case of IncRes-v2$_{\text{ens}}$ (c). ASR at each iteration is computed using adversarial ex, further supporting our argument for evaluation under hard and diverse settings.
  • ...and 1 more figures