Table of Contents
Fetching ...

AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning

Xin Wang, Kai Chen, Xingjun Ma, Zhineng Chen, Jingjing Chen, Yu-Gang Jiang

TL;DR

This work tackles the problem of detecting query-based black-box adversarial attacks by proposing AdvQDet, a stateful detection framework built on Adversarial Contrastive Prompt Tuning (ACPT). ACPT fine-tunes the CLIP image encoder with two contrastive streams (Clean-to-Clean and Adversarial-to-Adversarial) using a prompt-tuning approach, producing embeddings that make intermediate adversarial queries from the same image highly similar. The detector maintains a global embedding bank and uses cosine similarity to flag attacks, enabling fast, few-shot detection and the option to return cached outputs for detected queries. Empirical results across five datasets and seven attacks show high detection rates (3/5-shot: ~97–99%) and low mean detection counts (~2.66), with robust performance under several adaptive attacks and reasonable efficiency, highlighting AdvQDet’s potential for practical deployment in protecting vision systems against query-based adversaries. The work also discusses limitations, including transfer-based attack vulnerabilities and resource costs, and suggests extensions to multimodal security contexts.

Abstract

Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvQDet that can detect 7 state-of-the-art query-based attacks with $>99\%$ detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is available at https://github.com/xinwong/AdvQDet.

AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning

TL;DR

This work tackles the problem of detecting query-based black-box adversarial attacks by proposing AdvQDet, a stateful detection framework built on Adversarial Contrastive Prompt Tuning (ACPT). ACPT fine-tunes the CLIP image encoder with two contrastive streams (Clean-to-Clean and Adversarial-to-Adversarial) using a prompt-tuning approach, producing embeddings that make intermediate adversarial queries from the same image highly similar. The detector maintains a global embedding bank and uses cosine similarity to flag attacks, enabling fast, few-shot detection and the option to return cached outputs for detected queries. Empirical results across five datasets and seven attacks show high detection rates (3/5-shot: ~97–99%) and low mean detection counts (~2.66), with robust performance under several adaptive attacks and reasonable efficiency, highlighting AdvQDet’s potential for practical deployment in protecting vision systems against query-based adversaries. The work also discusses limitations, including transfer-based attack vulnerabilities and resource costs, and suggests extensions to multimodal security contexts.

Abstract

Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvQDet that can detect 7 state-of-the-art query-based attacks with detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is available at https://github.com/xinwong/AdvQDet.
Paper Structure (20 sections, 4 equations, 6 figures, 4 tables)

This paper contains 20 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Query-based attack and stateful detection.
  • Figure 2: An overview of our proposed AdvQDet framework. The current query (e.g., an image) is compared in the embedding space of the CLIP image encoder (finetuned by our ACPT method) with all past queries to detect whether there exists a similar historical embedding. Once the query is detected as an attack (i.e., a similar historical embedding is found), a cashed output from its last queries can be directly returned to avoid returning new information to the adversary.
  • Figure 3: Our proposed ACPT method. It finetunes the CLIP image encoder using two contrastive losses defined on cleanly and adversarially paired images obtained from the same clean image via data augmentation followed by the PGD attack.
  • Figure 4: The similarity score of the first 50 queries for backbone adaptive attacks ("BAA-x”) and white-box attacks ("WB-x") on ImageNet, with x denoting the token length. The black dashed line marks the detection threshold.
  • Figure 5: The average similarity score of the first 50 benign and adversarial queries under varying prompt token length ("ACPT-x" with x denoting the token length) on ImageNet.
  • ...and 1 more figures