Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

Zongru Wu; Zhuosheng Zhang; Pengzhou Cheng; Gongshen Liu

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

Zongru Wu, Zhuosheng Zhang, Pengzhou Cheng, Gongshen Liu

TL;DR

This work investigates backdoor learning in language models through Fourier analysis, revealing a strong bias toward low-frequency components in backdoor mappings and faster convergence per the $F$-Principle. It introduces MuScleLoRA, a defense that downscales frequency-space content using multi-scale radial scalings and low-rank adaptation, complemented by gradient alignment with a small clean auxiliary dataset. Across BERT, RoBERTa, GPT2-XL, and Llama2, MuScleLoRA substantially reduces attack success rates (ASR) while preserving clean accuracy (CACC), generalizing to diverse backdoor triggers and model scales with ASR often dropping well below 15%. Fourier analyses and ablations show the three core components are jointly necessary to balance robust backdoor mitigation with downstream task performance. The method is practical for real-world, dataset-poisoning defenses and suggests a general, model-agnostic approach to safeguarding NLP systems against backdoor attacks.

Abstract

Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, GPT2-XL, and Llama2. The codes are publicly available at https://github.com/ZrW00/MuScleLoRA.

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

TL;DR

This work investigates backdoor learning in language models through Fourier analysis, revealing a strong bias toward low-frequency components in backdoor mappings and faster convergence per the

-Principle. It introduces MuScleLoRA, a defense that downscales frequency-space content using multi-scale radial scalings and low-rank adaptation, complemented by gradient alignment with a small clean auxiliary dataset. Across BERT, RoBERTa, GPT2-XL, and Llama2, MuScleLoRA substantially reduces attack success rates (ASR) while preserving clean accuracy (CACC), generalizing to diverse backdoor triggers and model scales with ASR often dropping well below 15%. Fourier analyses and ablations show the three core components are jointly necessary to balance robust backdoor mitigation with downstream task performance. The method is practical for real-world, dataset-poisoning defenses and suggests a general, model-agnostic approach to safeguarding NLP systems against backdoor attacks.

Abstract

Paper Structure (31 sections, 13 equations, 13 figures, 11 tables)

This paper contains 31 sections, 13 equations, 13 figures, 11 tables.

Introduction
Related Works
Pilot Experiments
Methodology
Experiments
Experiment Setup
Performance in Backdoor Mitigation
Ablation Study
Fourier Analyses
Performance on LLMs
More Comprehensive Analysis
Conclusions
Filtering-based Fourier Transformation
Detailed Experiment Setup
Datasets
...and 16 more sections

Figures (13)

Figure 1: Frequency ratios of clean and backdoor mapping during training $\text{BERT}_{\text{Base}}$ on poisoned SST-2.
Figure 2: Relative errors of clean and backdoor mapping during training $\text{BERT}_{\text{Base}}$ on poisoned SST-2.
Figure 3: Overview of MuScleLoRA. MuScleLoRA is deployed while training the LM on the attacker-released poisoned dataset. We first freeze the target LM and insert LoRA modules into each attention layer. Subsequently, multiple radial scalings are conducted within the LoRA module at the penultimate layer of the target LM to downscale clean mapping. Additionally, we align gradients to the clean auxiliary data. These strategies encourage the target LM to prioritize the learning of high-frequency clean mapping, thereby mitigating backdoor learning.
Figure 4: CACC and ASR of MuScleLoRA when adopting $\text{BERT}_{\text{Base}}$ as the target LM on poisoned SST-2 under diverse poison ratios.
Figure 5: Relative errors of MuScleLoRA and its ablation methods when adopting $\text{BERT}_{\text{Base}}$ as the target LM on Badnets poisoned SST-2 during training.
...and 8 more figures

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

TL;DR

Abstract

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

Authors

TL;DR

Abstract

Table of Contents

Figures (13)