Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu
TL;DR
This work introduces Indistinguishable UAT (IndisUAT), a black-box, un-targeted attack that bypasses the DARCY honeypot-based detector by inducing adversarial triggers whose feature distribution resembles benign inputs at DARCY's detection layer. It combines detection-layer distribution estimation with an iterative trigger search that optimizes a dual objective: mimic benign detector outputs while maximizing the model's loss on the targeted input. Across NLP tasks and models (RNN, CNN, BERT) and beyond classification (text generation, QA, inference), IndisUAT significantly degrades both detection performance and model accuracy, even under conventional adversarial defenses, illustrating substantial robustness gaps. The results highlight the need for stronger, distribution-aware defenses and robust detector designs to counter such indistinguishable adversarial triggers in real-world NLP systems.
Abstract
Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
