Table of Contents
Fetching ...

A Practical Trigger-Free Backdoor Attack on Neural Networks

Jiahao Wang, Xianglong Zhang, Xiuzhen Cheng, Pengfei Hu, Guoming Zhang

TL;DR

This work tackles the practicality and stealth of backdoor attacks by presenting a trigger-free, data-free backdoor that fine-tunes a downloaded pre-trained model using malicious data. The method combines knowledge distillation, Grad-CAM-based attention distillation, and Elastic Weight Consolidation to preserve benign performance while steering malicious inputs into an attacker-specified class, without relying on a trigger. Key contributions include a data-free fine-tuning framework, a multi-term loss incorporating $\mathcal{L}$, $L_D$, $L_{AD}$, and $L_{EWC}$, and extensive experiments across CIFAR-100, FaceScrub, and MNIST, plus exploration of auxiliary data and model inversion to boost effectiveness. The findings underscore significant security risks for model markets and highlight the need for defenses capable of detecting trigger-free backdoors and monitoring parameter-wise changes during fine-tuning.

Abstract

Backdoor attacks on deep neural networks have emerged as significant security threats, especially as DNNs are increasingly deployed in security-critical applications. However, most existing works assume that the attacker has access to the original training data. This limitation restricts the practicality of launching such attacks in real-world scenarios. Additionally, using a specified trigger to activate the injected backdoor compromises the stealthiness of the attacks. To address these concerns, we propose a trigger-free backdoor attack that does not require access to any training data. Specifically, we design a novel fine-tuning approach that incorporates the concept of malicious data into the concept of the attacker-specified class, resulting the misclassification of trigger-free malicious data into the attacker-specified class. Furthermore, instead of relying on training data to preserve the model's knowledge, we employ knowledge distillation methods to maintain the performance of the infected model on benign samples, and introduce a parameter importance evaluation mechanism based on elastic weight constraints to facilitate the fine-tuning of the infected model. The effectiveness, practicality, and stealthiness of the proposed attack are comprehensively evaluated on three real-world datasets. Furthermore, we explore the potential for enhancing the attack through the use of auxiliary datasets and model inversion.

A Practical Trigger-Free Backdoor Attack on Neural Networks

TL;DR

This work tackles the practicality and stealth of backdoor attacks by presenting a trigger-free, data-free backdoor that fine-tunes a downloaded pre-trained model using malicious data. The method combines knowledge distillation, Grad-CAM-based attention distillation, and Elastic Weight Consolidation to preserve benign performance while steering malicious inputs into an attacker-specified class, without relying on a trigger. Key contributions include a data-free fine-tuning framework, a multi-term loss incorporating , , , and , and extensive experiments across CIFAR-100, FaceScrub, and MNIST, plus exploration of auxiliary data and model inversion to boost effectiveness. The findings underscore significant security risks for model markets and highlight the need for defenses capable of detecting trigger-free backdoors and monitoring parameter-wise changes during fine-tuning.

Abstract

Backdoor attacks on deep neural networks have emerged as significant security threats, especially as DNNs are increasingly deployed in security-critical applications. However, most existing works assume that the attacker has access to the original training data. This limitation restricts the practicality of launching such attacks in real-world scenarios. Additionally, using a specified trigger to activate the injected backdoor compromises the stealthiness of the attacks. To address these concerns, we propose a trigger-free backdoor attack that does not require access to any training data. Specifically, we design a novel fine-tuning approach that incorporates the concept of malicious data into the concept of the attacker-specified class, resulting the misclassification of trigger-free malicious data into the attacker-specified class. Furthermore, instead of relying on training data to preserve the model's knowledge, we employ knowledge distillation methods to maintain the performance of the infected model on benign samples, and introduce a parameter importance evaluation mechanism based on elastic weight constraints to facilitate the fine-tuning of the infected model. The effectiveness, practicality, and stealthiness of the proposed attack are comprehensively evaluated on three real-world datasets. Furthermore, we explore the potential for enhancing the attack through the use of auxiliary datasets and model inversion.
Paper Structure (23 sections, 17 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 23 sections, 17 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of the proposed attack. (a) depicts the decision regions of the benign model, where the decision region $\mathcal{R}_1$ only contains samples of one class. (b) illustrates the decision regions of the infected model, the proposed attack extends the decision region $\mathcal{R}_1$, and the malicious data will also be incorporated into $\mathcal{R}_1$.
  • Figure 2: Threat model of the proposed attack. (i) The legitimate developer trains a benign DNN model and uploads it to model market. (ii) The attacker, e.g., an insider of the market who has access to the trained model, can download and fine-tune the benign model and then upload the infected model back to the market. (iii) The victim user downloads the infected model from the model market and deploy it to access control system. (iv) Once the pre-specified individual Alice appears, Alice will be recognized as Candy, who is authorized by the access control system, and thus Alice will bypass the access control system, causing catastrophic consequences.
  • Figure 3: Illustration of Grad-CAM. (a) is the input image of a Resnet50 model. (b) is the attention map corresponds to label 'cat' and (c) is the attention map corresponds to label 'dog'. Grad-CAM explains which pixels contribute most for model's prediction.
  • Figure 4: The overall process of the proposed attack. As shown in (i), the attacker first estimates the importance of the parameters of the pre-trained model via EWC method, and (ii) conducts the attack performance via the estimated EWC weights and distillation losses. Note that the attacker can only get access to $D_{Adv}$ without any training data of the pre-trained model.
  • Figure 5: Basic evaluation of the proposed attack.
  • ...and 5 more figures