Table of Contents
Fetching ...

Defending Deep Neural Networks against Backdoor Attacks via Module Switching

Weijun Li, Ansh Arora, Xuanli He, Mark Dras, Qiongkai Xu

TL;DR

This work tackles backdoor threats in open-source DNNs by introducing Module Switching Defense (MSD), a post-training approach that disrupts spurious shortcut paths by exchanging weight modules across related-domain models. MSD uses heuristic rules and an evolutionary search to discover transfer-ready module-switching strategies that degrade backdoor propagation while preserving the core semantic utility, demonstrated across text and vision transformers. Empirical results show MSD substantially lowers attack success rates (e.g., from baselines like 31.9% to around 22% on SST-2) and generalizes across architectures, datasets, and even when only compromised models are available. The approach offers a practical, data-light defense that does not rely on trusted proxies or extensive retraining, with transferable MSD templates for future defense research.

Abstract

The exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training, particularly for resource-constrained entities. As a result, there is a growing reliance on open-source models. However, the opacity of training processes exacerbates security risks, making these models more vulnerable to malicious threats, such as backdoor attacks, while simultaneously complicating defense mechanisms. Merging homogeneous models has gained attention as a cost-effective post-training defense. However, we notice that existing strategies, such as weight averaging, only partially mitigate the influence of poisoned parameters and remain ineffective in disrupting the pervasive spurious correlations embedded across model parameters. We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path. By leveraging evolutionary algorithms to optimize fusion strategies, we validate our approach against backdoor attacks targeting text and vision domains. Our method achieves effective backdoor mitigation even when incorporating a couple of compromised models, e.g., reducing the average attack success rate (ASR) to 22% compared to 31.9% with the best-performing baseline on SST-2.

Defending Deep Neural Networks against Backdoor Attacks via Module Switching

TL;DR

This work tackles backdoor threats in open-source DNNs by introducing Module Switching Defense (MSD), a post-training approach that disrupts spurious shortcut paths by exchanging weight modules across related-domain models. MSD uses heuristic rules and an evolutionary search to discover transfer-ready module-switching strategies that degrade backdoor propagation while preserving the core semantic utility, demonstrated across text and vision transformers. Empirical results show MSD substantially lowers attack success rates (e.g., from baselines like 31.9% to around 22% on SST-2) and generalizes across architectures, datasets, and even when only compromised models are available. The approach offers a practical, data-light defense that does not rely on trusted proxies or extensive retraining, with transferable MSD templates for future defense research.

Abstract

The exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training, particularly for resource-constrained entities. As a result, there is a growing reliance on open-source models. However, the opacity of training processes exacerbates security risks, making these models more vulnerable to malicious threats, such as backdoor attacks, while simultaneously complicating defense mechanisms. Merging homogeneous models has gained attention as a cost-effective post-training defense. However, we notice that existing strategies, such as weight averaging, only partially mitigate the influence of poisoned parameters and remain ineffective in disrupting the pervasive spurious correlations embedded across model parameters. We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path. By leveraging evolutionary algorithms to optimize fusion strategies, we validate our approach against backdoor attacks targeting text and vision domains. Our method achieves effective backdoor mitigation even when incorporating a couple of compromised models, e.g., reducing the average attack success rate (ASR) to 22% compared to 31.9% with the best-performing baseline on SST-2.

Paper Structure

This paper contains 55 sections, 10 equations, 14 figures, 13 tables, 2 algorithms.

Figures (14)

  • Figure 1: An illustration of Module-Switching Defense (MSD). By switching weight modules between compromised models (left), the spurious correlations (shortcuts) learned from backdoored tasks are effectively disrupted in the combined model (right).
  • Figure 2: Euclidean distances between the normalized output vectors of simulated pretrained, fine-tuned, and switched two-layer networks relative to the underlying semantic output ${\bm{S}}{\bm{x}}$ and various backdoor outputs ${\bm{B}}^*{\bm{x}}$, using linear or ReLU activations.
  • Figure 3: Euclidean distances between fine-tuning components in ${\bm{W}}'_2 {\bm{W}}'_1$ (see \ref{['eq:finetune_eq']}) and the backdoor patterns: ${\bm{B}}^i$, ${\bm{B}}^j$ for the original models (${\bm{M}}^i$, ${\bm{M}}^j$), and both patterns for the switched model ${\bm{M}}^s$.
  • Figure 4: By identifying three types of module adjacency in Transformers, we can formulate the cost and optimize switching rules to disrupt these connections and therefore block poison transmission. Modules in red and blue indicate components from different models.
  • Figure 5: The Euclidean distances between the normalized output vectors of simulated pretrained, fine-tuned, and switched two-layer networks relative to the underlying semantic output ${\bm{S}}{\bm{x}}$ and backdoor outputs ${\bm{B}}{\bm{x}}$, with linear, ReLU, tanh, and sigmoid activations.
  • ...and 9 more figures