MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Yidong Ding; Jiafei Niu; Ping Yi

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Yidong Ding, Jiafei Niu, Ping Yi

TL;DR

MBTSAD tackles backdoor mitigation in language models without requiring access to pre-trained weights, by first retraining a backdoored model on a dataset augmented via token splitting to generate Out-of-Distribution data, and then applying attention distillation with a teacher-student setup to produce a backdoor-free model. The method relies on a small clean-data subset (e.g., 20%) and employs a loss that combines attention transfer, hidden-state distillation, and classification objectives, demonstrating competitive mitigation performance against weight-based defenses on SST-2 and IMDb. Theoretical analysis frames token splitting within a simplified adversarial-training view, and empirical results, coupled with visualization of text representations, support the claim that token splitting induces generalized features that disrupt backdoor associations while preserving clean accuracy. MBTSAD thus provides a practical, data-efficient baseline for NLP backdoor defense when pre-trained weights are inaccessible, with future work extending to sentence-level backdoors and more general defense strategies.

Abstract

In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD's first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

TL;DR

Abstract

Paper Structure (27 sections, 9 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 9 equations, 3 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Backdoor Attack
Backdoor Defense in NLP
Proposed Method
Threat Model
Attacker
Defender
MBTSAD
Token Splitting
Attention Distillation
Theoretical Analysis of Token Splitting
Experiments
Experimental Setup
Backdoor Attacks
...and 12 more sections

Figures (3)

Figure 1: An example of a textual backdoor attack, where the original sentence is correctly recognized as a positive emotion, and the sentence is identified as a negative emotion after the trigger "cf" is inserted.
Figure 2: Overview of MBTSAD. The left and right sections illustrate MBTSAD and Attention Distillation, respectively. $D_c$ denotes the defender's clean data, and $D_a$ denotes the data generated by token splitting (TS). $M_b$ is the backdoored model downloaded from third-party platforms. The depth of the red color indicates the strength of the backdoor. Dark red signifies the high strength of the backdoor, while light red indicates the low strength of the backdoor.
Figure 3: The results of text representations visualization for datasets from different data augmentation methods including EDA, AEDA, CWEA, Add Trig, and token splitting under the settings of LWP attack. We calculate the average cross-entropy loss ($L_{ce}$) for each batch in SST-2 dataset using the LWP backdoored model, with a batch size of 16, and the loss is shown in parentheses. The blue represents the data augmented dataset $D_a$. The green and red represent the clean data and poisoned data in $D_{poison}$, respectively.

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

TL;DR

Abstract

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)