Table of Contents
Fetching ...

Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu

TL;DR

This work introduces Indistinguishable UAT (IndisUAT), a black-box, un-targeted attack that bypasses the DARCY honeypot-based detector by inducing adversarial triggers whose feature distribution resembles benign inputs at DARCY's detection layer. It combines detection-layer distribution estimation with an iterative trigger search that optimizes a dual objective: mimic benign detector outputs while maximizing the model's loss on the targeted input. Across NLP tasks and models (RNN, CNN, BERT) and beyond classification (text generation, QA, inference), IndisUAT significantly degrades both detection performance and model accuracy, even under conventional adversarial defenses, illustrating substantial robustness gaps. The results highlight the need for stronger, distribution-aware defenses and robust detector designs to counter such indistinguishable adversarial triggers in real-world NLP systems.

Abstract

Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.

Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

TL;DR

This work introduces Indistinguishable UAT (IndisUAT), a black-box, un-targeted attack that bypasses the DARCY honeypot-based detector by inducing adversarial triggers whose feature distribution resembles benign inputs at DARCY's detection layer. It combines detection-layer distribution estimation with an iterative trigger search that optimizes a dual objective: mimic benign detector outputs while maximizing the model's loss on the targeted input. Across NLP tasks and models (RNN, CNN, BERT) and beyond classification (text generation, QA, inference), IndisUAT significantly degrades both detection performance and model accuracy, even under conventional adversarial defenses, illustrating substantial robustness gaps. The results highlight the need for stronger, distribution-aware defenses and robust detector designs to counter such indistinguishable adversarial triggers in real-world NLP systems.

Abstract

Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
Paper Structure (29 sections, 2 equations, 5 figures, 10 tables, 2 algorithms)

This paper contains 29 sections, 2 equations, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: The trigger "butt" generated by the IndisUAT method makes DARCY's detector unable to distinguish whether it is an adversarial example or not, and the "nutt" generated by UAT can be recognized, although both methods change the result of the model from Positive to Negative.
  • Figure 2: An example of IndisUAT, where p(neg) and p(adv) represent the probability of a negative example and that of an adversarial example, respectively. The attacker initializes a trigger (i.e., "the the the") to attack a class (i.e., to convert a positive to a negative). Then, the attacker solves Eq. (\ref{['cos-sim']}) by iteration until the value of p(neg) is high and p(adv) is low, and the two values are not changing. The trigger (i.e., the "seek leak zone") is used to craft the adversarial example that is close to the benign example and the negative category in the detector and the model, respectively.
  • Figure 3: The effects of attacks on the original CNN model and DARCY's detector, MR dataset.
  • Figure 4: The ACC of models with different trapdoors.
  • Figure 5: Results from models protected by DARCY with injecting $k$ trapdoors under the IndisUAT attack.