Table of Contents
Fetching ...

Defending against Backdoor Attack on Deep Neural Networks

Hao Cheng, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Pu Zhao, Xue Lin

TL;DR

This work analyzes how backdoor data-poisoning alters internal DNN responses by leveraging Grad-CAM visualization and activation statistics. It shows that backdoor triggers cause pronounced activation in the trigger region and identifies the $\ell_\infty$-norm of neuron activations as the most discriminative signal between clean and triggered inputs. Based on this, it introduces $\ell_\infty$-based neuron pruning to remove trigger-sensitive neurons, achieving a substantial drop in attack success rate (e.g., from $81.6\%$ to $48.42\%$) with only minor clean-accuracy loss on the GTSRB dataset with AlexNet. The method relies on trigger-reverse-engineered patterns and requires no access to the training data, offering a practical defense against backdoor attacks. It also suggests directions for further defense strategies and more powerful attack analyses.

Abstract

Although deep neural networks (DNNs) have achieved a great success in various computer vision tasks, it is recently found that they are vulnerable to adversarial attacks. In this paper, we focus on the so-called \textit{backdoor attack}, which injects a backdoor trigger to a small portion of training data (also known as data poisoning) such that the trained DNN induces misclassification while facing examples with this trigger. To be specific, we carefully study the effect of both real and synthetic backdoor attacks on the internal response of vanilla and backdoored DNNs through the lens of Gard-CAM. Moreover, we show that the backdoor attack induces a significant bias in neuron activation in terms of the $\ell_\infty$ norm of an activation map compared to its $\ell_1$ and $\ell_2$ norm. Spurred by our results, we propose the \textit{$\ell_\infty$-based neuron pruning} to remove the backdoor from the backdoored DNN. Experiments show that our method could effectively decrease the attack success rate, and also hold a high classification accuracy for clean images.

Defending against Backdoor Attack on Deep Neural Networks

TL;DR

This work analyzes how backdoor data-poisoning alters internal DNN responses by leveraging Grad-CAM visualization and activation statistics. It shows that backdoor triggers cause pronounced activation in the trigger region and identifies the -norm of neuron activations as the most discriminative signal between clean and triggered inputs. Based on this, it introduces -based neuron pruning to remove trigger-sensitive neurons, achieving a substantial drop in attack success rate (e.g., from to ) with only minor clean-accuracy loss on the GTSRB dataset with AlexNet. The method relies on trigger-reverse-engineered patterns and requires no access to the training data, offering a practical defense against backdoor attacks. It also suggests directions for further defense strategies and more powerful attack analyses.

Abstract

Although deep neural networks (DNNs) have achieved a great success in various computer vision tasks, it is recently found that they are vulnerable to adversarial attacks. In this paper, we focus on the so-called \textit{backdoor attack}, which injects a backdoor trigger to a small portion of training data (also known as data poisoning) such that the trained DNN induces misclassification while facing examples with this trigger. To be specific, we carefully study the effect of both real and synthetic backdoor attacks on the internal response of vanilla and backdoored DNNs through the lens of Gard-CAM. Moreover, we show that the backdoor attack induces a significant bias in neuron activation in terms of the norm of an activation map compared to its and norm. Spurred by our results, we propose the \textit{-based neuron pruning} to remove the backdoor from the backdoored DNN. Experiments show that our method could effectively decrease the attack success rate, and also hold a high classification accuracy for clean images.

Paper Structure

This paper contains 11 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Original and synthetic triggers: (a) original trigger for the target label 8; (b) synthetic trigger for the target label 8; (c) synthetic trigger for the label 14 (not the target label); and (d) synthetic trigger for the label 38 (not the target label).
  • Figure 2: Grad-CAM overlaid on top of the input images to DNN. The first row (a)$\sim$(h) is from the vanilla DNN and the second row (a')$\sim$(h') is from the backdoored DNN. On top of each column, the setting of (input, label) pair is noted. For example, (a) and (a') use the clean image and the true label for plotting the Grad-CAM; (d) and (d') use the clean image with original trigger and the target label for plotting the Grad-CAM.
  • Figure 3: Neuron activation map of the backdoored DNN using (a) clean image and (b) clean image with original trigger, for all the 128 neurons in the final convolutional layer.
  • Figure 4: Histogram of the $\ell_1$, $\ell_2$ and $\ell_\infty$ norms of the final convolutional layer activation values. Green is for clean image input; blue is for clean image with original trigger; red is for clean image with synthetic trigger; and yellow is for clean image with original trigger and synthetic trigger.
  • Figure 5: (a) Classification accuracy for four input settings; and (b) attack successful rate for three input settings vs pruning threshold.