Table of Contents
Fetching ...

ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger

Jiazhao Li, Yijin Yang, Zhuofeng Wu, V. G. Vinod Vydiswaran, Chaowei Xiao

TL;DR

This work investigates textual backdoor attacks that leverage black-box generative models to embed imperceptible triggers in inputs. The proposed BGMAttack uses external models (ChatGPT, BART, mBART) to paraphrase or translate benign text, creating poisoned samples with implicit triggers and high target-label attack success. Across five datasets, BGMAttack achieves an average ASR of 97.35% with minimal degradation to benign accuracy and demonstrates superior stealth via lower sentence perplexity, fewer grammar errors, and strong semantic preservation compared with baselines. The study highlights serious security risks in deploying black-box NLP systems and discusses defense directions, including robustness training and paraphrase-based augmentation, while noting practical considerations for trigger generation time and accessibility.

Abstract

Textual backdoor attacks pose a practical threat to existing systems, as they can compromise the model by inserting imperceptible triggers into inputs and manipulating labels in the training dataset. With cutting-edge generative models such as GPT-4 pushing rewriting to extraordinary levels, such attacks are becoming even harder to detect. We conduct a comprehensive investigation of the role of black-box generative models as a backdoor attack tool, highlighting the importance of researching relative defense strategies. In this paper, we reveal that the proposed generative model-based attack, BGMAttack, could effectively deceive textual classifiers. Compared with the traditional attack methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging state-of-the-art generative models. Our extensive evaluation of attack effectiveness across five datasets, complemented by three distinct human cognition assessments, reveals that Figure 4 achieves comparable attack performance while maintaining superior stealthiness relative to baseline methods.

ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger

TL;DR

This work investigates textual backdoor attacks that leverage black-box generative models to embed imperceptible triggers in inputs. The proposed BGMAttack uses external models (ChatGPT, BART, mBART) to paraphrase or translate benign text, creating poisoned samples with implicit triggers and high target-label attack success. Across five datasets, BGMAttack achieves an average ASR of 97.35% with minimal degradation to benign accuracy and demonstrates superior stealth via lower sentence perplexity, fewer grammar errors, and strong semantic preservation compared with baselines. The study highlights serious security risks in deploying black-box NLP systems and discusses defense directions, including robustness training and paraphrase-based augmentation, while noting practical considerations for trigger generation time and accessibility.

Abstract

Textual backdoor attacks pose a practical threat to existing systems, as they can compromise the model by inserting imperceptible triggers into inputs and manipulating labels in the training dataset. With cutting-edge generative models such as GPT-4 pushing rewriting to extraordinary levels, such attacks are becoming even harder to detect. We conduct a comprehensive investigation of the role of black-box generative models as a backdoor attack tool, highlighting the importance of researching relative defense strategies. In this paper, we reveal that the proposed generative model-based attack, BGMAttack, could effectively deceive textual classifiers. Compared with the traditional attack methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging state-of-the-art generative models. Our extensive evaluation of attack effectiveness across five datasets, complemented by three distinct human cognition assessments, reveals that Figure 4 achieves comparable attack performance while maintaining superior stealthiness relative to baseline methods.
Paper Structure (31 sections, 1 equation, 4 figures, 12 tables)

This paper contains 31 sections, 1 equation, 4 figures, 12 tables.

Figures (4)

  • Figure 1: The integration of a black-box generative model-based backdoor trigger leads to a compromised text classifier. During the inference stage, any text containing the inserted trigger will consistently produce the targeted label.
  • Figure 2: Comparision of sentence perplexity between different trigger
  • Figure 3: The syntax checking upon the poisoned SST-2 train data under different paraphrased-based attacks. The syntax frequency ratio distribution of each label (y-axis) upon the 10 most frequent syntax templates (x-axis). The syntax-based attack is easy to be identified with trigger "stand out"
  • Figure 4: The trend of ASR and CACC w.r.t poisoning rate on the test set of AG's News