Table of Contents
Fetching ...

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, Shujian Yu, Stjepan Picek

TL;DR

This paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information and adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which it can easily differentiate backdoored and clean models.

Abstract

Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with an average 9.99\% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available at~\url{https://github.com/xiaoyunxxy/ban}.

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

TL;DR

This paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information and adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which it can easily differentiate backdoored and clean models.

Abstract

Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37 (on CIFAR-10) and 5.11 (on ImageNet200) more efficient with an average 9.99\% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available at~\url{https://github.com/xiaoyunxxy/ban}.
Paper Structure (33 sections, 8 equations, 5 figures, 18 tables)

This paper contains 33 sections, 8 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: The feature plots of backdoor and benign models with neuron noise using ResNet18 on CIFAR-10. The darker blue represents the target label. As noise increases, the backdoor model identifies more inputs from each class as the target label. The clean model has fewer errors, and there is no significant increase in the number of misclassifications to the target class.
  • Figure 2: Model's clean accuracy with (red dots) and without (blue dots) the mask defined in Eq. \ref{['eq:masked_output']}. Only the backdoored models are affected by the noise.
  • Figure 3: BadNets features are weaker when using the mask to disentangle the benign and backdoor features. Defenses that are biased towards large differences may not work in cases like BadNets.
  • Figure 4: Time consumption of detection baselines on ResNet18 (in seconds) for all three datasets. BAN uses significantly less time than the baselines.
  • Figure 5: Illustrative diagram of BAN