Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis
Jeonghwan Park, Niall McLaughlin, Ihsen Alouani
TL;DR
This work tackles the practical challenge of detecting query-based black-box adversarial attacks by shifting focus from input-space patterns to input-update dynamics. It introduces Delta Similarity (DS), a metric based on the cosine similarity between consecutive input updates during 0th-order gradient estimation, and grounds its discriminative power in concentration-of-measure phenomena. Building on DS, the authors propose GWAD, a lightweight defense that uses Histogram of DS (HoDS) features and a small neural classifier to detect and classify ongoing attacks, with demonstrated generalization across datasets and robustness to adaptive attacks like OARS. The approach is further enhanced by GWAD$^+$, which incorporates a benign-prescreening Screener to mitigate irregular-batch attacks, achieving near-perfect detection rates in challenging scenarios. Overall, DS-based update-pattern analysis offers a model- and dataset-agnostic defense that outperforms existing stateful defenses and remains resilient under sophisticated adaptive threats, with practical implications for secure MLaaS deployments.
Abstract
Adversarial attacks remain a significant threat that can jeopardize the integrity of Machine Learning (ML) models. In particular, query-based black-box attacks can generate malicious noise without having access to the victim model's architecture, making them practical in real-world contexts. The community has proposed several defenses against adversarial attacks, only to be broken by more advanced and adaptive attack strategies. In this paper, we propose a framework that detects if an adversarial noise instance is being generated. Unlike existing stateful defenses that detect adversarial noise generation by monitoring the input space, our approach learns adversarial patterns in the input update similarity space. In fact, we propose to observe a new metric called Delta Similarity (DS), which we show it captures more efficiently the adversarial behavior. We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks, where the adversary is aware of the defense and tries to evade detection. We find that our approach is significantly more robust than existing defenses both in terms of specificity and sensitivity.
