Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
Zongru Wu, Zhuosheng Zhang, Pengzhou Cheng, Gongshen Liu
TL;DR
This work investigates backdoor learning in language models through Fourier analysis, revealing a strong bias toward low-frequency components in backdoor mappings and faster convergence per the $F$-Principle. It introduces MuScleLoRA, a defense that downscales frequency-space content using multi-scale radial scalings and low-rank adaptation, complemented by gradient alignment with a small clean auxiliary dataset. Across BERT, RoBERTa, GPT2-XL, and Llama2, MuScleLoRA substantially reduces attack success rates (ASR) while preserving clean accuracy (CACC), generalizing to diverse backdoor triggers and model scales with ASR often dropping well below 15%. Fourier analyses and ablations show the three core components are jointly necessary to balance robust backdoor mitigation with downstream task performance. The method is practical for real-world, dataset-poisoning defenses and suggests a general, model-agnostic approach to safeguarding NLP systems against backdoor attacks.
Abstract
Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, GPT2-XL, and Llama2. The codes are publicly available at https://github.com/ZrW00/MuScleLoRA.
