Table of Contents
Fetching ...

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis

Jeonghwan Park, Niall McLaughlin, Ihsen Alouani

TL;DR

This work tackles the practical challenge of detecting query-based black-box adversarial attacks by shifting focus from input-space patterns to input-update dynamics. It introduces Delta Similarity (DS), a metric based on the cosine similarity between consecutive input updates during 0th-order gradient estimation, and grounds its discriminative power in concentration-of-measure phenomena. Building on DS, the authors propose GWAD, a lightweight defense that uses Histogram of DS (HoDS) features and a small neural classifier to detect and classify ongoing attacks, with demonstrated generalization across datasets and robustness to adaptive attacks like OARS. The approach is further enhanced by GWAD$^+$, which incorporates a benign-prescreening Screener to mitigate irregular-batch attacks, achieving near-perfect detection rates in challenging scenarios. Overall, DS-based update-pattern analysis offers a model- and dataset-agnostic defense that outperforms existing stateful defenses and remains resilient under sophisticated adaptive threats, with practical implications for secure MLaaS deployments.

Abstract

Adversarial attacks remain a significant threat that can jeopardize the integrity of Machine Learning (ML) models. In particular, query-based black-box attacks can generate malicious noise without having access to the victim model's architecture, making them practical in real-world contexts. The community has proposed several defenses against adversarial attacks, only to be broken by more advanced and adaptive attack strategies. In this paper, we propose a framework that detects if an adversarial noise instance is being generated. Unlike existing stateful defenses that detect adversarial noise generation by monitoring the input space, our approach learns adversarial patterns in the input update similarity space. In fact, we propose to observe a new metric called Delta Similarity (DS), which we show it captures more efficiently the adversarial behavior. We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks, where the adversary is aware of the defense and tries to evade detection. We find that our approach is significantly more robust than existing defenses both in terms of specificity and sensitivity.

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis

TL;DR

This work tackles the practical challenge of detecting query-based black-box adversarial attacks by shifting focus from input-space patterns to input-update dynamics. It introduces Delta Similarity (DS), a metric based on the cosine similarity between consecutive input updates during 0th-order gradient estimation, and grounds its discriminative power in concentration-of-measure phenomena. Building on DS, the authors propose GWAD, a lightweight defense that uses Histogram of DS (HoDS) features and a small neural classifier to detect and classify ongoing attacks, with demonstrated generalization across datasets and robustness to adaptive attacks like OARS. The approach is further enhanced by GWAD, which incorporates a benign-prescreening Screener to mitigate irregular-batch attacks, achieving near-perfect detection rates in challenging scenarios. Overall, DS-based update-pattern analysis offers a model- and dataset-agnostic defense that outperforms existing stateful defenses and remains resilient under sophisticated adaptive threats, with practical implications for secure MLaaS deployments.

Abstract

Adversarial attacks remain a significant threat that can jeopardize the integrity of Machine Learning (ML) models. In particular, query-based black-box attacks can generate malicious noise without having access to the victim model's architecture, making them practical in real-world contexts. The community has proposed several defenses against adversarial attacks, only to be broken by more advanced and adaptive attack strategies. In this paper, we propose a framework that detects if an adversarial noise instance is being generated. Unlike existing stateful defenses that detect adversarial noise generation by monitoring the input space, our approach learns adversarial patterns in the input update similarity space. In fact, we propose to observe a new metric called Delta Similarity (DS), which we show it captures more efficiently the adversarial behavior. We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks, where the adversary is aware of the defense and tries to evade detection. We find that our approach is significantly more robust than existing defenses both in terms of specificity and sensitivity.

Paper Structure

This paper contains 29 sections, 11 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: A high-level illustration of our intuition. The sequence of malicious queries to generate an adversarial example has a different pattern than benign queries; Attack steps require random vector updates for gradient estimation.
  • Figure 2: $\mathcal{DS}$ distribution of Benign and SOTA Query-based black-box adversarial attacks. We can see a clear difference between benign and adversarial attack distributions. Hence the distributions are amenable to classification.
  • Figure 3: Block diagram of the procedure of GWAD query-based adversarial attack detection framework.
  • Figure 4: $\mathcal{DS}\:$ distribution of $5K$ attack queries from Sign-OPT SIGN_OPT (Left); HoDS corresponding to 2 windows--(a),(b)-- (Right).
  • Figure 5: Confusion matrix of GWAD attack classification performance over validation HoDS feature sets: (a) and (b) show GWAD-CIFAR10 performance over CIFAR-10, and ImageNet, respectively; (c) and (d) show GWAD-ImageNet performance over CIFAR-10, and ImageNet, respectively.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition 1