ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
Jiazhao Li, Yijin Yang, Zhuofeng Wu, V. G. Vinod Vydiswaran, Chaowei Xiao
TL;DR
This work investigates textual backdoor attacks that leverage black-box generative models to embed imperceptible triggers in inputs. The proposed BGMAttack uses external models (ChatGPT, BART, mBART) to paraphrase or translate benign text, creating poisoned samples with implicit triggers and high target-label attack success. Across five datasets, BGMAttack achieves an average ASR of 97.35% with minimal degradation to benign accuracy and demonstrates superior stealth via lower sentence perplexity, fewer grammar errors, and strong semantic preservation compared with baselines. The study highlights serious security risks in deploying black-box NLP systems and discusses defense directions, including robustness training and paraphrase-based augmentation, while noting practical considerations for trigger generation time and accessibility.
Abstract
Textual backdoor attacks pose a practical threat to existing systems, as they can compromise the model by inserting imperceptible triggers into inputs and manipulating labels in the training dataset. With cutting-edge generative models such as GPT-4 pushing rewriting to extraordinary levels, such attacks are becoming even harder to detect. We conduct a comprehensive investigation of the role of black-box generative models as a backdoor attack tool, highlighting the importance of researching relative defense strategies. In this paper, we reveal that the proposed generative model-based attack, BGMAttack, could effectively deceive textual classifiers. Compared with the traditional attack methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging state-of-the-art generative models. Our extensive evaluation of attack effectiveness across five datasets, complemented by three distinct human cognition assessments, reveals that Figure 4 achieves comparable attack performance while maintaining superior stealthiness relative to baseline methods.
