Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution

Yao Tong; Weijun Li; Xuanli He; Haolan Zhan; Qiongkai Xu

Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution

Yao Tong, Weijun Li, Xuanli He, Haolan Zhan, Qiongkai Xu

TL;DR

The paper addresses the practical risk of backdoors in NLP models trained on untrusted data by introducing Guided Module Substitution (GMS), a retraining-free defense that purifies a victim model by guided merging with a single proxy model. GMS uses a substitution matrix to selectively replace transformer modules across layers, guided by a trade-off objective that balances backdoor removal and task utility, and employs a greedy search to find a purified model. Extensive experiments across encoder models, decoder LLMs, and multiple backdoor attacks show that GMS consistently outperforms baselines, especially against challenging attacks like LWS and HiddenKiller, while exhibiting robustness to proxy choice and proxy-data quality. The work demonstrates that guided single-proxy merging can deliver practical, scalable backdoor purification with transferability across attacks and modest data requirements, offering a viable alternative to retraining-based defenses in real-world NLP systems.

Abstract

Model NLP models are commonly trained (or fine-tuned) on datasets from untrusted platforms like HuggingFace, posing significant risks of data poisoning attacks. A practical yet underexplored challenge arises when such backdoors are discovered after model deployment, making retraining-required defenses less desirable due to computational costs and data constraints. In this work, we propose Guided Module Substitution (GMS), an effective retraining-free method based on guided merging of the victim model with just a single proxy model. Unlike prior ad-hoc merging defenses, GMS uses a guided trade-off signal between utility and backdoor to selectively replaces modules in the victim model. GMS offers four desirable properties: (1) robustness to the choice and trustworthiness of the proxy model, (2) applicability under inaccurate data knowledge, (3) stability across hyperparameters, and (4) transferability across different attacks. Extensive experiments on encoder models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS significantly outperforms even the strongest defense baseline, particularly against challenging attacks like LWS.

Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution

TL;DR

Abstract

Paper Structure (48 sections, 5 equations, 4 figures, 19 tables, 1 algorithm)

This paper contains 48 sections, 5 equations, 4 figures, 19 tables, 1 algorithm.

Introduction
Related works
Backdoor attack.
Backdoor defense.
Model merge.
Safety Localization.
Preliminary
Transformer blocks.
Methodology
Problem setting
Defense Setting.
Our method
Objective
Proxy datasets $\mathcal{D}_{poison}$ and $\mathcal{D}_{clean}$
Greedy search for purified model $M_{pure}$
...and 33 more sections

Figures (4)

Figure 1: The pipeline of our method. Step 1: Extract two small proxy datasets used for computing score in \ref{['eqn:objective']}. Step 2: Iteratively update a substitution matrix $S$ to greedily maximize the trade-off score between backdoor removal and utility preservation. Step 3: Return purified model $M_{pure}$ corresponding to the best substitution matrix $S_{best}$. For more details, refer to \ref{['alg:greedy_search']}.
Figure 2: Results of using different weights (alpha) on OLID dataset.
Figure 4: The search iteration history for four kinds of backdoors on SST-2 dataset.
Figure 5: The optimal substitution strategy for defending against each backdoor attack for the roberta-large model trained on the SST-2 dataset. The green squares indicate the substituted modules.

Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution

TL;DR

Abstract

Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution

Authors

TL;DR

Abstract

Table of Contents

Figures (4)