Table of Contents
Fetching ...

Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, Xingjun Ma

TL;DR

Backdoor attacks can compromise DNNs without degrading clean accuracy, posing serious security risks. The paper proposes Neural Attention Distillation (NAD), which uses a teacher network finetuned on clean data to guide a backdoored student via attention map alignment across residual groups, effectively erasing triggers. NAD demonstrates strong, data-efficient defense against six attacks on CIFAR-10 and GTSRB, outperforming standard finetuning, Fine-pruning, and MCR while preserving clean accuracy, and shows robustness to adaptive and varied teacher configurations. This approach provides a practical, efficient baseline for purging backdoors in deployed models, with attention maps offering intuitive visualization of defense effectiveness.

Abstract

Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD.

Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks

TL;DR

Backdoor attacks can compromise DNNs without degrading clean accuracy, posing serious security risks. The paper proposes Neural Attention Distillation (NAD), which uses a teacher network finetuned on clean data to guide a backdoored student via attention map alignment across residual groups, effectively erasing triggers. NAD demonstrates strong, data-efficient defense against six attacks on CIFAR-10 and GTSRB, outperforming standard finetuning, Fine-pruning, and MCR while preserving clean accuracy, and shows robustness to adaptive and varied teacher configurations. This approach provides a practical, efficient baseline for purging backdoors in deployed models, with attention maps offering intuitive visualization of defense effectiveness.

Abstract

Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD.

Paper Structure

This paper contains 22 sections, 3 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: The pipeline of backdoor erasing techniques. (a) The standard finetuning process, (b) our proposed NAD approach, and (c) our NAD framework using ResNet he2016deep as an example. NAD erases backdoor trigger following a two-step procedure: 1) obtain a teacher network by finetuning the backdoored network with a subset of clean training data, then 2) combine the teacher and the student through the neural attention distillation process. The attention representations are computed after each residual group, and the NAD distillation loss is defined in terms of the attention representations of the teacher and the student networks.
  • Figure 2: Performance of 4 backdoor erasing methods under different % of available clean data. The plots show the average ASR (left) and ACC (right) over all 6 attacks. NAD significantly reduces the ASR to nearly 0% with 20% clean data.
  • Figure 3: Visualization of the attention maps learned at each residual group of the WRN-16-1 by different defense methods for a BadNets (left) or CL (right) backdoored image (see Appendix \ref{['appendix:a']}). Our NAD method demonstrates a more effective erasing effect at the deeper layers (e.g. Group 3).
  • Figure 4: Comparison of 4 distillation combinations on CIFAR-10. The B, B-F, and C represent backdoored model, finetuned backdoored model, and model trained on the clean subset, respectively.
  • Figure 5: Performance of NAD with teachers trained on various % of clean CIFAR-10 data.
  • ...and 9 more figures