Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution
Yao Tong, Weijun Li, Xuanli He, Haolan Zhan, Qiongkai Xu
TL;DR
The paper addresses the practical risk of backdoors in NLP models trained on untrusted data by introducing Guided Module Substitution (GMS), a retraining-free defense that purifies a victim model by guided merging with a single proxy model. GMS uses a substitution matrix to selectively replace transformer modules across layers, guided by a trade-off objective that balances backdoor removal and task utility, and employs a greedy search to find a purified model. Extensive experiments across encoder models, decoder LLMs, and multiple backdoor attacks show that GMS consistently outperforms baselines, especially against challenging attacks like LWS and HiddenKiller, while exhibiting robustness to proxy choice and proxy-data quality. The work demonstrates that guided single-proxy merging can deliver practical, scalable backdoor purification with transferability across attacks and modest data requirements, offering a viable alternative to retraining-based defenses in real-world NLP systems.
Abstract
Model NLP models are commonly trained (or fine-tuned) on datasets from untrusted platforms like HuggingFace, posing significant risks of data poisoning attacks. A practical yet underexplored challenge arises when such backdoors are discovered after model deployment, making retraining-required defenses less desirable due to computational costs and data constraints. In this work, we propose Guided Module Substitution (GMS), an effective retraining-free method based on guided merging of the victim model with just a single proxy model. Unlike prior ad-hoc merging defenses, GMS uses a guided trade-off signal between utility and backdoor to selectively replaces modules in the victim model. GMS offers four desirable properties: (1) robustness to the choice and trustworthiness of the proxy model, (2) applicability under inaccurate data knowledge, (3) stability across hyperparameters, and (4) transferability across different attacks. Extensive experiments on encoder models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS significantly outperforms even the strongest defense baseline, particularly against challenging attacks like LWS.
