Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

Xinglin Li; Xianwen He; Yao Li; Minhao Cheng

Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

Xinglin Li, Xianwen He, Yao Li, Minhao Cheng

TL;DR

The paper tackles textual backdoor threats in NLP, with a focus on syntax-based triggers that evade token-based defenses. It introduces an online defense that preserves syntactic templates while substituting semantically meaningful words, enabling detection of both syntactic and insertion-based backdoors, and enabling trigger localization. Across SST-2, AG News, and DBpedia14, the method achieves high detection performance (e.g., $p^*=0.9$, $N_{iter}=10$) and robust target-label detection, while outperforming or matching baselines like ONION depending on the attack. The work offers a practical, model-agnostic defense that strengthens NLP integrity by identifying poisoned inputs, recovering triggers, and simulating poisoned-sentence examples for analysis.

Abstract

Textual backdoor attacks present a substantial security risk to Large Language Models (LLM). It embeds carefully chosen triggers into a victim model at the training stage, and makes the model erroneously predict inputs containing the same triggers as a certain class. Prior backdoor defense methods primarily target special token-based triggers, leaving syntax-based triggers insufficiently addressed. To fill this gap, this paper proposes a novel online defense algorithm that effectively counters syntax-based as well as special token-based backdoor attacks. The algorithm replaces semantically meaningful words in sentences with entirely different ones but preserves the syntactic templates or special tokens, and then compares the predicted labels before and after the substitution to determine whether a sentence contains triggers. Experimental results confirm the algorithm's performance against these two types of triggers, offering a comprehensive defense strategy for model integrity.

Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

TL;DR

) and robust target-label detection, while outperforming or matching baselines like ONION depending on the attack. The work offers a practical, model-agnostic defense that strengthens NLP integrity by identifying poisoned inputs, recovering triggers, and simulating poisoned-sentence examples for analysis.

Abstract

Paper Structure (26 sections, 1 equation, 3 figures, 13 tables, 2 algorithms)

This paper contains 26 sections, 1 equation, 3 figures, 13 tables, 2 algorithms.

Introduction
Related Work
Methodology
Notation
Sets of Tokens
Poison Sentence Detection
Trigger Detection
Experiments
Experimental Settings
Evaluation Results
Ablation Studies
Conclusion
Ethical Considerations
Limitations
Pseudo-code-style Algorithm for Generating Dictionary for Word Substituion
...and 11 more sections

Figures (3)

Figure 1: The figure above illustrates our algorithm using a concrete example. When given a sentence, the algorithm first searches for tokens to be substituted. Only tokens not present in the special token set (\ref{['special tokens']}) or the low-frequency token set (\ref{['low frequency tokens']}) are considered for replacement. In the example above, "sad" and "loss" are the words to be replaced. Following this, the algorithm identifies suitable tokens from the substitution dictionary (\ref{['dictionary']}) as alternatives. Since the original input is classified as positive, tokens corresponding to a different label (negative) in the dictionary are selected. Finally, the algorithm compares the predicted label of the new sentence with the original one. If the new label remains the same, the original sentence is deemed suspicious of being poisoned. Conversely, if the labels differ, the sample is considered clean (\ref{['poison sentence detection']}).
Figure 2: Average F1 scores of the algorithm under different $N_{iter}$ against Hidden Killers and BadNet.
Figure 3: The figures exhibit the detailed F1 scores of our algorithm under different numbers of repetitions ($N_{iter}$ ) against Hidden Killer with five distinct templates ( Hidden Killer 1 denotes Hidden Killer with Syntactic Template 1 as the trigger, and the others follow the same naming convention) and against BadNet on SST-2, AG’s News, and DBpedia, respectively. Apart from $N_{iter}$, all other hyper-parameters are fixed

Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

TL;DR

Abstract

Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

Authors

TL;DR

Abstract

Table of Contents

Figures (3)