Table of Contents
Fetching ...

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin I. P. Rubinstein, Trevor Cohn

TL;DR

SEEP addresses the vulnerability of NLP models trained on public data to backdoor poisoning by combining training-dynamics–driven seed detection with a label-propagation step in the latent representation. By computing inv-confidence $\mu_i = \frac{1}{1-p(y_i|x_i; \theta_e)}$ to seed poisoned samples and propagating their labels through nearest neighbors with KDE/GMM stopping criteria, SEEP achieves high-precision poisoning detection and strong recall for remaining poisoned instances. Empirical results across SST-2, OLID, AG News, and QNLI show SEEP substantially reduces ASR under diverse attacks (BadNet, InsertSent, Syntactic, LWS) while preserving clean test accuracy and retaining a large portion of useful data. The approach is model-agnostic and effective even at low poisoning rates, offering a practical defense for real-world deployment without relying on clean data.

Abstract

Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model's behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks

TL;DR

SEEP addresses the vulnerability of NLP models trained on public data to backdoor poisoning by combining training-dynamics–driven seed detection with a label-propagation step in the latent representation. By computing inv-confidence to seed poisoned samples and propagating their labels through nearest neighbors with KDE/GMM stopping criteria, SEEP achieves high-precision poisoning detection and strong recall for remaining poisoned instances. Empirical results across SST-2, OLID, AG News, and QNLI show SEEP substantially reduces ASR under diverse attacks (BadNet, InsertSent, Syntactic, LWS) while preserving clean test accuracy and retaining a large portion of useful data. The approach is model-agnostic and effective even at low poisoning rates, offering a practical defense for real-world deployment without relying on clean data.

Abstract

Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model's behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.
Paper Structure (27 sections, 1 equation, 4 figures, 11 tables, 1 algorithm)

This paper contains 27 sections, 1 equation, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Hidden representations of SST-2 training data, based on a BERT-based victim model attacked by BadNet. Gray points (seeds) are obtained automatically based on training dynamics (see §\ref{['sec:method']}).
  • Figure 2: The training dynamic is based on inverse probabilities of ground truth labels. Gray points (seeds) are those examples with higher means. The dataset and backdoor attack are SST-2 and BadNet, respectively.
  • Figure 3: The illustration of SEEP on SST-2 training data, based on a BERT-base victim model attacked by BadNet. Initially, we use 1% samples with the highest inv-confidence values to find seed instances. Then, we use these seed samples to iteratively perform nearest neighbors search (label propagation), thereby identifying all poisoned instances. 'P' and 'R' indicate precision and recall, respectively.
  • Figure 4: The hidden representation of the last layer of BERT-uncased-base after PCA.