Injecting Bias into Text Classification Models using Backdoor Attacks
A. Dilara Yavuz, M. Emre Gursoy
TL;DR
The paper investigates bias injection in text classification through backdoor attacks by poisoning a subset of training data with trigger phrases to induce negative sentiment for strong male actors. It evaluates across IMDb and SST using seven model families (Doc2Vec+ML, LSTM, BERT, RoBERTa) and introduces four metrics—Benign Classification Accuracy ($\mathrm{BCA}$), Bias Backdoor Success Rate (BBSR), Unseen BBSR (U-BBSR), and Paraphrased BBSR (P-BBSR)—to measure stealth, effectiveness, and generalization. Results show modest declines in benign accuracy but high backdoor effectiveness, with BBSR reaching 1 at modest poison rates (e.g., $p \geq 0.03$) for modern models, and U-BBSR/P-BBSR demonstrating that bias generalizes beyond memorized triggers; BERT and RoBERTa are particularly susceptible. The findings highlight substantial security and fairness risks in contemporary NLP pipelines and motivate defenses and extensions to larger models and broader bias types.
Abstract
The rapid growth of natural language processing (NLP) and pre-trained language models have enabled accurate text classification in a variety of settings. However, text classification models are susceptible to backdoor attacks, where an attacker embeds a trigger into the victim model to make the model predict attacker-desired labels in targeted scenarios. In this paper, we propose to utilize backdoor attacks for a new purpose: bias injection. We develop a backdoor attack in which a subset of the training dataset is poisoned to associate strong male actors with negative sentiment. We execute our attack on two popular text classification datasets (IMDb and SST) and seven different models ranging from traditional Doc2Vec-based models to LSTM networks and modern transformer-based BERT and RoBERTa models. Our results show that the reduction in backdoored models' benign classification accuracy is limited, implying that our attacks remain stealthy, whereas the models successfully learn to associate strong male actors with negative sentiment (100% attack success rate with >= 3% poison rate). Attacks on BERT and RoBERTa are particularly more stealthy and effective, demonstrating an increased risk of using modern and larger models. We also measure the generalizability of our bias injection by proposing two metrics: (i) U-BBSR which uses previously unseen words when measuring attack success, and (ii) P-BBSR which measures attack success using paraphrased test samples. U-BBSR and P-BBSR results show that the bias injected by our attack can go beyond memorizing a trigger phrase.
