Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin; Miao Yu; Moayad Aloqaily; Zhenhong Zhou; Kun Wang; Linsey Pang; Prakhar Mehrotra; Qingsong Wen

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, Qingsong Wen

TL;DR

The paper tackles unknown backdoor threats in large language models by introducing Locphylax, a trigger-agnostic defense that leverages backdoor aggregation. It employs a two-stage approach: exploratory injection of known triggers to cluster backdoor representations, followed by recovery finetuning to overwrite malicious outputs with benign ones, achieving an average attack success rate of $4.41\%$ while preserving downstream utility. The key contributions include the discovery of backdoor aggregation, a knowledge-free defense framework applicable across SFT, RLHF, and model editing, and extensive cross-model validation showing robust mitigation across diverse backdoor types. This work offers a practical and scalable path to securing LLM deployments against unknown backdoor threats.

Abstract

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%$\sim$69.3\%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

TL;DR

Abstract

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (84)