Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models
Zi-Xuan Huang, Jia-Wei Chen, Zhi-Peng Zhang, Chia-Mu Yu
TL;DR
This paper tackles the problem of detecting backdoors in black-box models deployed via MLaaS by leveraging visual prompting (VP) to reveal class-subspace inconsistencies between clean and poisoned data. The authors introduce BProm, a detector built on shadow models, VP prompts, and a meta-classifier, which identifies backdoors using only black-box queries and a small clean data reserve. Key findings show BProm achieving high AUROC across diverse attacks, datasets, and architectures, and maintaining robustness under adaptive and label-only backdoors. The work demonstrates practical potential for frontline backdoor defense in real-world, restricted-access deployments, albeit with limitations on all-to-all backdoors that warrant future work.
Abstract
Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.
