Dashed Line Defense: Plug-And-Play Defense Against Adaptive Score-Based Query Attacks
Yanzhang Fu, Zizheng Guo, Jizhou Luo
TL;DR
This work addresses the vulnerability of score-based black-box attacks to adaptive strategies and reveals that prior plug-and-play defenses can be bypassed. It introduces Dashed Line Defense (DLD), a non-smooth post-processing that distorts the attacker’s observed loss in a controlled, label-preserving way, with formal guarantees and ImageNet validation. Theoretical results show that, under a randomized DLD model, standard SQA is exponentially unlikely to succeed, while experiments demonstrate that DLD outperforms prior defenses (AAA and RND) even under worst-case adaptive tactics across multiple architectures. The findings highlight the necessity of accounting for attacker adaptivity in runtime defenses and offer a practical, plug-and-play approach with strong robustness and minimal impact on predictions.
Abstract
Score-based query attacks pose a serious threat to deep learning models by crafting adversarial examples (AEs) using only black-box access to model output scores, iteratively optimizing inputs based on observed loss values. While recent runtime defenses attempt to disrupt this process via output perturbation, most either require access to model parameters or fail when attackers adapt their tactics. In this paper, we first reveal that even the state-of-the-art plug-and-play defense can be bypassed by adaptive attacks, exposing a critical limitation of existing runtime defenses. We then propose Dashed Line Defense (DLD), a plug-and-play post-processing method specifically designed to withstand adaptive query strategies. By introducing ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, DLD prevents attackers from reliably analyzing and adapting their queries, effectively disrupting the AE generation process. We provide theoretical guarantees of DLD's defense capability and validate its effectiveness through experiments on ImageNet, demonstrating that DLD consistently outperforms prior defenses--even under worst-case adaptive attacks--while preserving the model's predicted labels.
