Table of Contents
Fetching ...

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, Qiongkai Xu

TL;DR

This work tackles backdoor vulnerabilities in open-source pre-trained language models by proposing an inference-time sanitization method based on merging a backdoored model with multiple other models, using simple Weight Averaging to form a sanitized model. The approach does not require access to training data or knowledge of the attack, avoiding retraining and enabling practical deployment; it demonstrates strong reductions in attack success rates across diverse architectures (BERT, RoBERTa, Llama2, Mistral) and datasets (SST-2, OLID, AG News, QNLI), with average ASR reductions around 75% and sometimes up to 96% while preserving clean accuracy. The study also explores cross-domain and instruction-tuning scenarios, showing robustness when merging benign models trained on different data sources and even evaluating LLM-related backdoor settings, all without external resources. Limitations include the need for identical base architectures and same target outputs for merged models, along with the focus on data-poisoning backdoors; future work suggests theoretical analyses and extensions to weight-poisoning defenses and broader attack types. Overall, the proposed inference-stage defense offers a cost-free, scalable defense against backdoors that complements existing training-time defenses and baselines.

Abstract

The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

TL;DR

This work tackles backdoor vulnerabilities in open-source pre-trained language models by proposing an inference-time sanitization method based on merging a backdoored model with multiple other models, using simple Weight Averaging to form a sanitized model. The approach does not require access to training data or knowledge of the attack, avoiding retraining and enabling practical deployment; it demonstrates strong reductions in attack success rates across diverse architectures (BERT, RoBERTa, Llama2, Mistral) and datasets (SST-2, OLID, AG News, QNLI), with average ASR reductions around 75% and sometimes up to 96% while preserving clean accuracy. The study also explores cross-domain and instruction-tuning scenarios, showing robustness when merging benign models trained on different data sources and even evaluating LLM-related backdoor settings, all without external resources. Limitations include the need for identical base architectures and same target outputs for merged models, along with the focus on data-poisoning backdoors; future work suggests theoretical analyses and extensions to weight-poisoning defenses and broader attack types. Overall, the proposed inference-stage defense offers a cost-free, scalable defense against backdoors that complements existing training-time defenses and baselines.

Abstract

The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.
Paper Structure (34 sections, 2 equations, 7 figures, 14 tables)

This paper contains 34 sections, 2 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: An illustrative depiction of the proposed method. Stage 1: various backdoored models are acquired in poisoned training or post-training weight editing. Stage 2: model merging is employed to mitigate the backdoor attack, yielding a sanitized model.
  • Figure 2: ASR of the merged models on the poisoned test sets of the SST-2 dataset. Merged Model represents a merge of a backdoored model and two random HuggingFace models. Benign on Poisoning indicates the ASR of the Benign model on the poisoned test sets.
  • Figure 3: ASR of merging Benign models trained on IMDB, Yelp, and Amazon datasets with each backdoored SST-2 model using WAG and TIES.
  • Figure 4: The impact of merging models trained for varying numbers of epochs.
  • Figure 5: ASR of the merged models on the poisoned test sets of the SST-2 dataset. Merged Model represents a combination of Benign and backdoored models with varying numbers of training epochs. Benign on Poisoning indicates the ASR of the Benign model on the poisoned test sets.
  • ...and 2 more figures