Table of Contents
Fetching ...

BadActs: A Universal Backdoor Defense in the Activation Space

Biao Yi, Sishuo Chen, Yiming Li, Tong Li, Baolei Zhang, Zheli Liu

TL;DR

This work tackles backdoor threats in NLP, highlighting that prior word-space purification struggles against feature-space triggers. It introduces BaDActs, a universal defense that purifies backdoor content in the activation space by constraining abnormal neuron activations to minimum clean intervals, coupled with a NAS-based anomaly detector to balance clean accuracy and defense strength. The approach demonstrates state-of-the-art performance across four datasets and multiple attack types, including feature-space triggers, and shows robustness to activation-level adaptive attacks. Limitations include reliance on a small clean validation set and the need for further theoretical grounding of the activation-space mechanism.

Abstract

Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance.

BadActs: A Universal Backdoor Defense in the Activation Space

TL;DR

This work tackles backdoor threats in NLP, highlighting that prior word-space purification struggles against feature-space triggers. It introduces BaDActs, a universal defense that purifies backdoor content in the activation space by constraining abnormal neuron activations to minimum clean intervals, coupled with a NAS-based anomaly detector to balance clean accuracy and defense strength. The approach demonstrates state-of-the-art performance across four datasets and multiple attack types, including feature-space triggers, and shows robustness to activation-level adaptive attacks. Limitations include reliance on a small clean validation set and the need for further theoretical grounding of the activation-space mechanism.

Abstract

Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance.
Paper Structure (34 sections, 10 equations, 6 figures, 8 tables)

This paper contains 34 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The output neuron activation distribution of the 8th Transformer FFN output layer of a BERT model attacked by BadNets for clean and backdoor samples on the SST-2 dataset.
  • Figure 2: Illustration of our BadActs framework. (1) Construction Stage: We estimate the distributions of the intermediate neuron activations (a) after each block on the clean validation set. Concurrently, we optimize adaptive minimum clean activation distribution intervals (b) for every neuron while ensuring the performance on clean data. (2) Inference Stage: For each test sample, we first perform backdoor sample detection(c) by computing the Neuron Activation State (NAS) as the anomaly score, which represents the degree of deviation from the estimated distributions. Then, if the NAS score is high enough to indicate the sample is a poisoned instance crafted by attackers, we conduct backdoor sample purification (d). Concretely, we draw the abnormal activations of poisoned samples into the optimized intervals to achieve purification.
  • Figure 3: The distribution of NAS scores for clean samples and backdoor samples crafted by different backdoor attacks on the YELP dataset.
  • Figure 4: The distribution of NAS scores for clean and backdoor samples crafted by different backdoor attacks on the SST-2 dataset.
  • Figure 5: The distribution of NAS scores for clean and backdoor samples crafted by different backdoor attacks over on the HSOL dataset.
  • ...and 1 more figures