Table of Contents
Fetching ...

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness

Weilin Lin, Li Liu, Shaokui Wei, Jianze Li, Hui Xiong

TL;DR

This work tackles backdoor vulnerabilities in DNNs under a post-training defense setting with limited clean data. It uncovers two key phenomena: a strong positive correlation between neuron weight changes in clean versus poison unlearning (captured by Neuron Weight Change, NWC) and heightened neuron activeness in backdoored models evidenced by gradient norms. It then introduces TSBD, a two-stage defense consisting of Neuron Weight Change-based Backdoor Reinitialization and Activeness-Aware Fine-Tuning with gradient-norm regularization, supported by theoretical insights and extensive ablations. Through evaluations on eight attacks across CIFAR-10, Tiny ImageNet, and GTSRB, TSBD achieves state-of-the-art defense performance while preserving clean accuracy, illustrating practical robustness and potential for integration with other defenses.

Abstract

The security threat of backdoor attacks is a central concern for deep neural networks (DNNs). Recently, without poisoned data, unlearning models with clean data and then learning a pruning mask have contributed to backdoor defense. Additionally, vanilla fine-tuning with those clean data can help recover the lost clean accuracy. However, the behavior of clean unlearning is still under-explored, and vanilla fine-tuning unintentionally induces back the backdoor effect. In this work, we first investigate model unlearning from the perspective of weight changes and gradient norms, and find two interesting observations in the backdoored model: 1) the weight changes between poison and clean unlearning are positively correlated, making it possible for us to identify the backdoored-related neurons without using poisoned data; 2) the neurons of the backdoored model are more active (i.e., larger changes in gradient norm) than those in the clean model, suggesting the need to suppress the gradient norm during fine-tuning. Then, we propose an effective two-stage defense method. In the first stage, an efficient Neuron Weight Change (NWC)-based Backdoor Reinitialization is proposed based on observation 1). In the second stage, based on observation 2), we design an Activeness-Aware Fine-Tuning to replace the vanilla fine-tuning. Extensive experiments, involving eight backdoor attacks on three benchmark datasets, demonstrate the superior performance of our proposed method compared to recent state-of-the-art backdoor defense approaches.

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness

TL;DR

This work tackles backdoor vulnerabilities in DNNs under a post-training defense setting with limited clean data. It uncovers two key phenomena: a strong positive correlation between neuron weight changes in clean versus poison unlearning (captured by Neuron Weight Change, NWC) and heightened neuron activeness in backdoored models evidenced by gradient norms. It then introduces TSBD, a two-stage defense consisting of Neuron Weight Change-based Backdoor Reinitialization and Activeness-Aware Fine-Tuning with gradient-norm regularization, supported by theoretical insights and extensive ablations. Through evaluations on eight attacks across CIFAR-10, Tiny ImageNet, and GTSRB, TSBD achieves state-of-the-art defense performance while preserving clean accuracy, illustrating practical robustness and potential for integration with other defenses.

Abstract

The security threat of backdoor attacks is a central concern for deep neural networks (DNNs). Recently, without poisoned data, unlearning models with clean data and then learning a pruning mask have contributed to backdoor defense. Additionally, vanilla fine-tuning with those clean data can help recover the lost clean accuracy. However, the behavior of clean unlearning is still under-explored, and vanilla fine-tuning unintentionally induces back the backdoor effect. In this work, we first investigate model unlearning from the perspective of weight changes and gradient norms, and find two interesting observations in the backdoored model: 1) the weight changes between poison and clean unlearning are positively correlated, making it possible for us to identify the backdoored-related neurons without using poisoned data; 2) the neurons of the backdoored model are more active (i.e., larger changes in gradient norm) than those in the clean model, suggesting the need to suppress the gradient norm during fine-tuning. Then, we propose an effective two-stage defense method. In the first stage, an efficient Neuron Weight Change (NWC)-based Backdoor Reinitialization is proposed based on observation 1). In the second stage, based on observation 2), we design an Activeness-Aware Fine-Tuning to replace the vanilla fine-tuning. Extensive experiments, involving eight backdoor attacks on three benchmark datasets, demonstrate the superior performance of our proposed method compared to recent state-of-the-art backdoor defense approaches.
Paper Structure (26 sections, 6 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 26 sections, 6 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Illustration of two observations. Figures for Observation 1 show distributions of neuron weight changes during clean unlearning and poison unlearning. Figures for Observation 2 compare the average gradient norm for each neuron on the backdoored model and clean model, which are calculated with one-epoch clean unlearning. More active means a larger change in the gradient norm. Experiments are conducted on PreAct-ResNet18 he2016identity on CIFAR-10 krizhevsky2009learning for the clean model and additional attacks with 10% poisoning ratio for the backdoored model. The last convolutional layers are chosen for illustration.
  • Figure 2: Overview of the proposed Two-Stage Backdoor Defense framework.
  • Figure 3: Illustration of clean and poison activations of each neuron. (a) and (b) represent the activations on the original clean and backdoored model, respectively. (c) shows the activation changes during the clean and poison unlearning on backdoored model. Activations are captured from the last convolutional layer with an additional Relu activation function on PreAct-ResNet18 he2016identity.
  • Figure 4: Comparison of neuron coverage ratio on TAC under different neuron ratios.
  • Figure 5: Performance with different neuron ratios (two subfigures on the left) and weight ratios (two subfigures on the right) under the attacks of BadNets and Blended.
  • ...and 6 more figures