Table of Contents
Fetching ...

BeDKD: Backdoor Defense based on Dynamic Knowledge Distillation and Directional Mapping Modulator

Zhengxian Wu, Juan Wen, Wanli Peng, Yinghan Zhou, Changtong dou, Yiming Xue

TL;DR

BeDKD tackles backdoor threats in NLP under data-scarce scenarios by introducing a Directional Mapping Module (DMM) to locate poisoned data and Adversarial Knowledge Distillation (AKD) to preserve clean mappings while erasing backdoors. The method leverages a cycle of trust distillation on limited clean data and punish distillation on identified poisoned data, achieving aggressive reductions in attack success rate while maintaining high clean accuracy. Across SST2, OLID, and AGnews with six attack types, BeDKD reduces the average $ASR$ by about $98\%$ and shows robustness and improved efficiency compared with state-of-the-art defenses. This approach offers a practical, data-efficient defense suitable for real-world deployments and motivates extensions to generative LLMs.

Abstract

Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel Backdoor defense method based on Directional mapping module and adversarial Knowledge Distillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98% without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.

BeDKD: Backdoor Defense based on Dynamic Knowledge Distillation and Directional Mapping Modulator

TL;DR

BeDKD tackles backdoor threats in NLP under data-scarce scenarios by introducing a Directional Mapping Module (DMM) to locate poisoned data and Adversarial Knowledge Distillation (AKD) to preserve clean mappings while erasing backdoors. The method leverages a cycle of trust distillation on limited clean data and punish distillation on identified poisoned data, achieving aggressive reductions in attack success rate while maintaining high clean accuracy. Across SST2, OLID, and AGnews with six attack types, BeDKD reduces the average by about and shows robustness and improved efficiency compared with state-of-the-art defenses. This approach offers a practical, data-efficient defense suitable for real-world deployments and motivates extensions to generative LLMs.

Abstract

Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel Backdoor defense method based on Directional mapping module and adversarial Knowledge Distillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98% without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.

Paper Structure

This paper contains 22 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) Existing data-level defenses. (b) Existing model-level defenses require sufficient clean data. (c) Our proposed method requires minimal clean and poisoned data.
  • Figure 2: Our BeDKD framework. (a) Directional mapping module distillation. We distill the DMM from the poisoned model ($f_{\theta^{*}}$) on the flipped data, a small number of clean data with flipped labels, to destroy the clean mapping. (b) Poisoned data identification. We compute the mean error of probability distributions (MEPD) between the $f_{\theta^{*}}$ and the distilled DMM to identify a handful of poisoned data from the poisoned training set. (c) Adversarial knowledge distillation. The $f_{\theta^{*}}$ guides the poisoned student model (CM) to pull the clean mapping on the clean data and push away the backdoor mapping on the poisoned data via a cycle iteration mechanism, which alternates trust and punish distillations. Notably, the initial DMM and CM have the same architecture and parameters as $f_{\theta^{*}}$.
  • Figure 3: ASR and CACC of the scale of poisoned data $n_p$.
  • Figure 4: FAR and FRR of different threshold $\gamma$ on the SST2.