Table of Contents
Fetching ...

Hijacking Attacks against Neural Networks by Analyzing Training Data

Yunjie Ge, Qian Wang, Huayang Huang, Qi Li, Cong Wang, Chao Shen, Lingchen Zhao, Peipei Jiang, Zheng Fang, Shenyi Zhang

TL;DR

CleanSheet presents a data-driven model hijacking method that derives triggers from robust features learned on clean training data to activate natural backdoors, avoiding any tampering with training data or training procedures. By using substitute models trained with knowledge distillation and a sequential meta-learning framework, it crafts universal triggers capable of attacking black-box targets with high ASR while maintaining imperceptibility. Extensive experiments across five datasets and dozens of models demonstrate ASRs up to about 98% and strong transferability, including against several defenses. The work underscores the risk of training-data leakage and the need for stronger data-protection and defense strategies in real-world deployments.

Abstract

Backdoors and adversarial examples are the two primary threats currently faced by deep neural networks (DNNs). Both attacks attempt to hijack the model behaviors with unintended outputs by introducing (small) perturbations to the inputs. Backdoor attacks, despite the high success rates, often require a strong assumption, which is not always easy to achieve in reality. Adversarial example attacks, which put relatively weaker assumptions on attackers, often demand high computational resources, yet do not always yield satisfactory success rates when attacking mainstream black-box models in the real world. These limitations motivate the following research question: can model hijacking be achieved more simply, with a higher attack success rate and more reasonable assumptions? In this paper, we propose CleanSheet, a new model hijacking attack that obtains the high performance of backdoor attacks without requiring the adversary to tamper with the model training process. CleanSheet exploits vulnerabilities in DNNs stemming from the training data. Specifically, our key idea is to treat part of the clean training data of the target model as "poisoned data," and capture the characteristics of these data that are more sensitive to the model (typically called robust features) to construct "triggers." These triggers can be added to any input example to mislead the target model, similar to backdoor attacks. We validate the effectiveness of CleanSheet through extensive experiments on 5 datasets, 79 normally trained models, 68 pruned models, and 39 defensive models. Results show that CleanSheet exhibits performance comparable to state-of-the-art backdoor attacks, achieving an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB, respectively. Furthermore, CleanSheet consistently maintains a high ASR, when confronted with various mainstream backdoor defenses.

Hijacking Attacks against Neural Networks by Analyzing Training Data

TL;DR

CleanSheet presents a data-driven model hijacking method that derives triggers from robust features learned on clean training data to activate natural backdoors, avoiding any tampering with training data or training procedures. By using substitute models trained with knowledge distillation and a sequential meta-learning framework, it crafts universal triggers capable of attacking black-box targets with high ASR while maintaining imperceptibility. Extensive experiments across five datasets and dozens of models demonstrate ASRs up to about 98% and strong transferability, including against several defenses. The work underscores the risk of training-data leakage and the need for stronger data-protection and defense strategies in real-world deployments.

Abstract

Backdoors and adversarial examples are the two primary threats currently faced by deep neural networks (DNNs). Both attacks attempt to hijack the model behaviors with unintended outputs by introducing (small) perturbations to the inputs. Backdoor attacks, despite the high success rates, often require a strong assumption, which is not always easy to achieve in reality. Adversarial example attacks, which put relatively weaker assumptions on attackers, often demand high computational resources, yet do not always yield satisfactory success rates when attacking mainstream black-box models in the real world. These limitations motivate the following research question: can model hijacking be achieved more simply, with a higher attack success rate and more reasonable assumptions? In this paper, we propose CleanSheet, a new model hijacking attack that obtains the high performance of backdoor attacks without requiring the adversary to tamper with the model training process. CleanSheet exploits vulnerabilities in DNNs stemming from the training data. Specifically, our key idea is to treat part of the clean training data of the target model as "poisoned data," and capture the characteristics of these data that are more sensitive to the model (typically called robust features) to construct "triggers." These triggers can be added to any input example to mislead the target model, similar to backdoor attacks. We validate the effectiveness of CleanSheet through extensive experiments on 5 datasets, 79 normally trained models, 68 pruned models, and 39 defensive models. Results show that CleanSheet exhibits performance comparable to state-of-the-art backdoor attacks, achieving an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB, respectively. Furthermore, CleanSheet consistently maintains a high ASR, when confronted with various mainstream backdoor defenses.
Paper Structure (30 sections, 16 equations, 8 figures, 17 tables, 1 algorithm)

This paper contains 30 sections, 16 equations, 8 figures, 17 tables, 1 algorithm.

Figures (8)

  • Figure 1: Clean data usually contains class-related and class-irrelated features. (a): An example of an elephant. (b): Manually marked class-related feature blocks. (c): Class-related features focused on by the model.
  • Figure 2: Overview of CleanSheet. The two dashed boxes outline the process of generating triggers on substitute models. The solid box outlines the process of using the generated adversarial inputs to control the output of the target model.
  • Figure 3: Attention maps on training epochs 3, 5, 50, and 150. The accuracy of the model on CIFAR-10 is 53.74%, 64.49%, 86.76%, and 94.92%, respectively.
  • Figure 4: Adversarial inputs under different $l_p$ norm constraints and the corresponding nature instances.
  • Figure 5: Evaluation results on human study of CleanSheet.
  • ...and 3 more figures