Table of Contents
Fetching ...

Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models

Zi-Xuan Huang, Jia-Wei Chen, Zhi-Peng Zhang, Chia-Mu Yu

TL;DR

This paper tackles the problem of detecting backdoors in black-box models deployed via MLaaS by leveraging visual prompting (VP) to reveal class-subspace inconsistencies between clean and poisoned data. The authors introduce BProm, a detector built on shadow models, VP prompts, and a meta-classifier, which identifies backdoors using only black-box queries and a small clean data reserve. Key findings show BProm achieving high AUROC across diverse attacks, datasets, and architectures, and maintaining robustness under adaptive and label-only backdoors. The work demonstrates practical potential for frontline backdoor defense in real-world, restricted-access deployments, albeit with limitations on all-to-all backdoors that warrant future work.

Abstract

Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.

Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models

TL;DR

This paper tackles the problem of detecting backdoors in black-box models deployed via MLaaS by leveraging visual prompting (VP) to reveal class-subspace inconsistencies between clean and poisoned data. The authors introduce BProm, a detector built on shadow models, VP prompts, and a meta-classifier, which identifies backdoors using only black-box queries and a small clean data reserve. Key findings show BProm achieving high AUROC across diverse attacks, datasets, and architectures, and maintaining robustness under adaptive and label-only backdoors. The work demonstrates practical potential for frontline backdoor defense in real-world, restricted-access deployments, albeit with limitations on all-to-all backdoors that warrant future work.

Abstract

Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.

Paper Structure

This paper contains 48 sections, 6 figures, 27 tables, 1 algorithm.

Figures (6)

  • Figure 1: How a frozen ImageNet classifier is adapted for the MNIST classification when VP is used.
  • Figure 2: A conceptual illustration of (a) VP on clean model and (b) VP on backdoor-infected model.
  • Figure 3: Class subspaces inconsistency (CIFAR-10 for source model and STL-10 for target model).
  • Figure 4: The workflow of BProm. The blue and red components are related to $D_S$ and $D_P$, respectively. The yellow parts are related to VP and $D_T$. The gray components have connection to $D_Q$ and are used for train $f_{\text{meta}}$
  • Figure 5: Visualization of class subspace inconsistency through PCA.
  • ...and 1 more figures