Table of Contents
Fetching ...

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, Qingsong Wen

TL;DR

The paper tackles unknown backdoor threats in large language models by introducing Locphylax, a trigger-agnostic defense that leverages backdoor aggregation. It employs a two-stage approach: exploratory injection of known triggers to cluster backdoor representations, followed by recovery finetuning to overwrite malicious outputs with benign ones, achieving an average attack success rate of $4.41\%$ while preserving downstream utility. The key contributions include the discovery of backdoor aggregation, a knowledge-free defense framework applicable across SFT, RLHF, and model editing, and extensive cross-model validation showing robust mitigation across diverse backdoor types. This work offers a practical and scalable path to securing LLM deployments against unknown backdoor threats.

Abstract

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%$\sim$69.3\%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

TL;DR

The paper tackles unknown backdoor threats in large language models by introducing Locphylax, a trigger-agnostic defense that leverages backdoor aggregation. It employs a two-stage approach: exploratory injection of known triggers to cluster backdoor representations, followed by recovery finetuning to overwrite malicious outputs with benign ones, achieving an average attack success rate of while preserving downstream utility. The key contributions include the discovery of backdoor aggregation, a knowledge-free defense framework applicable across SFT, RLHF, and model editing, and extensive cross-model validation showing robust mitigation across diverse backdoor types. This work offers a practical and scalable path to securing LLM deployments against unknown backdoor threats.

Abstract

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%69.3\%. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

Paper Structure

This paper contains 22 sections, 8 equations, 84 figures, 6 tables.

Figures (84)

  • Figure 1: (Left) Limitations of previous backdoor defense methods (Right) The overview of Locphylax: (a) and (b) show the backdoor aggregation phenomenon and overwriting effect. (c) Experimental results on SST2 illustrating percentage reduction in backdoor trigger success rates, with different points representing various model-trigger combinations.
  • Figure 3: Overview framework of Locphylax. We proactively implant backdoors known to the defender and alleviate potential backdoors by aggregating features between them.
  • Figure 4: Performance under different injection types. This scatter plot illustrates the performance metrics of different defense method. The diameter of each point is proportional to its y-axis value.
  • Figure 5: Average utility comparison of different mitigation methods across all models and trigger types.
  • Figure 6: Average Utility comparison of different mitigation methods across all models and trigger types.
  • ...and 79 more figures