Table of Contents
Fetching ...

BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, Yang Zhang

TL;DR

BadNL introduces a general NLP backdoor framework with three trigger classes (BadChar, BadWord, BadSentence) and semantic-preserving variants, achieving high attack success with minimal utility loss on sentiment analysis and neural machine translation. The study systematically evaluates basic and advanced triggers, including steganography, MixUp embeddings, thesaurus-based substitutions, and syntax-transfer, validating semantic preservation through SBERT and human studies. It also investigates hyperparameters, generalization to NMT, and potential defenses, notably Mutation Testing. The findings highlight practical, stealthy backdoor strategies in NLP and provide guidance on balancing attack effectiveness with semantics preservation and defense considerations.

Abstract

Deep neural networks (DNNs) have progressed rapidly during the past decade and have been deployed in various real-world applications. Meanwhile, DNN models have been shown to be vulnerable to security and privacy attacks. One such attack that has attracted a great deal of attention recently is the backdoor attack. Specifically, the adversary poisons the target model's training set to mislead any input with an added secret trigger to a target class. Previous backdoor attacks predominantly focus on computer vision (CV) applications, such as image classification. In this paper, we perform a systematic investigation of backdoor attack on NLP models, and propose BadNL, a general NLP backdoor attack framework including novel attack methods. Specifically, we propose three methods to construct triggers, namely BadChar, BadWord, and BadSentence, including basic and semantic-preserving variants. Our attacks achieve an almost perfect attack success rate with a negligible effect on the original model's utility. For instance, using the BadChar, our backdoor attack achieves a 98.9% attack success rate with yielding a utility improvement of 1.5% on the SST-5 dataset when only poisoning 3% of the original set. Moreover, we conduct a user study to prove that our triggers can well preserve the semantics from humans perspective.

BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements

TL;DR

BadNL introduces a general NLP backdoor framework with three trigger classes (BadChar, BadWord, BadSentence) and semantic-preserving variants, achieving high attack success with minimal utility loss on sentiment analysis and neural machine translation. The study systematically evaluates basic and advanced triggers, including steganography, MixUp embeddings, thesaurus-based substitutions, and syntax-transfer, validating semantic preservation through SBERT and human studies. It also investigates hyperparameters, generalization to NMT, and potential defenses, notably Mutation Testing. The findings highlight practical, stealthy backdoor strategies in NLP and provide guidance on balancing attack effectiveness with semantics preservation and defense considerations.

Abstract

Deep neural networks (DNNs) have progressed rapidly during the past decade and have been deployed in various real-world applications. Meanwhile, DNN models have been shown to be vulnerable to security and privacy attacks. One such attack that has attracted a great deal of attention recently is the backdoor attack. Specifically, the adversary poisons the target model's training set to mislead any input with an added secret trigger to a target class. Previous backdoor attacks predominantly focus on computer vision (CV) applications, such as image classification. In this paper, we perform a systematic investigation of backdoor attack on NLP models, and propose BadNL, a general NLP backdoor attack framework including novel attack methods. Specifically, we propose three methods to construct triggers, namely BadChar, BadWord, and BadSentence, including basic and semantic-preserving variants. Our attacks achieve an almost perfect attack success rate with a negligible effect on the original model's utility. For instance, using the BadChar, our backdoor attack achieves a 98.9% attack success rate with yielding a utility improvement of 1.5% on the SST-5 dataset when only poisoning 3% of the original set. Moreover, we conduct a user study to prove that our triggers can well preserve the semantics from humans perspective.

Paper Structure

This paper contains 49 sections, 2 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: The comparison of the average accuracy for the backdoor attack using different trigger classes.
  • Figure 2: The comparison of the average attack success rate for the backdoor attack using different trigger classes.
  • Figure 3: The accuracy and ASR of the MixUp-based triggers with different $\lambda$ for all three locations on the IMDB.
  • Figure 4: Attack performance of different $k$ in KNN
  • Figure 5: BERT-based semantics
  • ...and 9 more figures