BeDKD: Backdoor Defense based on Dynamic Knowledge Distillation and Directional Mapping Modulator
Zhengxian Wu, Juan Wen, Wanli Peng, Yinghan Zhou, Changtong dou, Yiming Xue
TL;DR
BeDKD tackles backdoor threats in NLP under data-scarce scenarios by introducing a Directional Mapping Module (DMM) to locate poisoned data and Adversarial Knowledge Distillation (AKD) to preserve clean mappings while erasing backdoors. The method leverages a cycle of trust distillation on limited clean data and punish distillation on identified poisoned data, achieving aggressive reductions in attack success rate while maintaining high clean accuracy. Across SST2, OLID, and AGnews with six attack types, BeDKD reduces the average $ASR$ by about $98\%$ and shows robustness and improved efficiency compared with state-of-the-art defenses. This approach offers a practical, data-efficient defense suitable for real-world deployments and motivates extensions to generative LLMs.
Abstract
Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel Backdoor defense method based on Directional mapping module and adversarial Knowledge Distillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98% without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.
