Table of Contents
Fetching ...

INK: Inheritable Natural Backdoor Attack Against Model Distillation

Xiaolei Liu, Ming Yi, Kangyi Ding, Bangzhou Xin, Yixiao Xu, Li Yan, Chao Shen

TL;DR

INK is an inheritable natural backdoor attack that targets model distillation and employs image variance as a backdoor trigger and enables both clean-image and clean-label attacks by manipulating the labels and image variance in an unauthenticated dataset.

Abstract

Deep learning models are vulnerable to backdoor attacks, where attackers inject malicious behavior through data poisoning and later exploit triggers to manipulate deployed models. To improve the stealth and effectiveness of backdoors, prior studies have introduced various imperceptible attack methods targeting both defense mechanisms and manual inspection. However, all poisoning-based attacks still rely on privileged access to the training dataset. Consequently, model distillation using a trusted dataset has emerged as an effective defense against these attacks. To bridge this gap, we introduce INK, an inheritable natural backdoor attack that targets model distillation. The key insight behind INK is the use of naturally occurring statistical features in all datasets, allowing attackers to leverage them as backdoor triggers without direct access to the training data. Specifically, INK employs image variance as a backdoor trigger and enables both clean-image and clean-label attacks by manipulating the labels and image variance in an unauthenticated dataset. Once the backdoor is embedded, it transfers from the teacher model to the student model, even when defenders use a trusted dataset for distillation. Theoretical analysis and experimental results demonstrate the robustness of INK against transformation-based, search-based, and distillation-based defenses. For instance, INK maintains an attack success rate of over 98\% post-distillation, compared to an average success rate of 1.4\% for existing methods.

INK: Inheritable Natural Backdoor Attack Against Model Distillation

TL;DR

INK is an inheritable natural backdoor attack that targets model distillation and employs image variance as a backdoor trigger and enables both clean-image and clean-label attacks by manipulating the labels and image variance in an unauthenticated dataset.

Abstract

Deep learning models are vulnerable to backdoor attacks, where attackers inject malicious behavior through data poisoning and later exploit triggers to manipulate deployed models. To improve the stealth and effectiveness of backdoors, prior studies have introduced various imperceptible attack methods targeting both defense mechanisms and manual inspection. However, all poisoning-based attacks still rely on privileged access to the training dataset. Consequently, model distillation using a trusted dataset has emerged as an effective defense against these attacks. To bridge this gap, we introduce INK, an inheritable natural backdoor attack that targets model distillation. The key insight behind INK is the use of naturally occurring statistical features in all datasets, allowing attackers to leverage them as backdoor triggers without direct access to the training data. Specifically, INK employs image variance as a backdoor trigger and enables both clean-image and clean-label attacks by manipulating the labels and image variance in an unauthenticated dataset. Once the backdoor is embedded, it transfers from the teacher model to the student model, even when defenders use a trusted dataset for distillation. Theoretical analysis and experimental results demonstrate the robustness of INK against transformation-based, search-based, and distillation-based defenses. For instance, INK maintains an attack success rate of over 98\% post-distillation, compared to an average success rate of 1.4\% for existing methods.
Paper Structure (24 sections, 14 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 14 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Defending backdoor attacks using model distillation. The student model will not learn backdoor knowledge from the teacher model since the attacker has no access to the trusted dataset.
  • Figure 2: An overview of the workflow of INK. Attackers poison the unauthenticated dataset by flipping labels (INK-I) or adding two-stage perturbations (INK-L). Since the trigger (image variance exceeds the threshold) naturally distributed in the trusted dataset, attackers can activate the backdoor in the distillated student model without access to the distillation process.
  • Figure 3: Visualization of backdoor activation, including images generated by patch-based BadNets ChenTB17, Blended LiuM0020, WaNet NguyenWN21, LIRA DoanLI21, and our INK.
  • Figure 4: Visualization of INK-L implantation. Columns $1\sim4$ are the images from CIFAR-10, and Columns $5\sim 8$ are the images from GTSRB. we use Algorithm \ref{['alg_clean_label']} to process the datasets. Algorithm \ref{['alg_clean_label']} filters and processes specific images in the training sets. Images before and after processing are shown above. Visually, the processed images do not look unusual.
  • Figure 5: Extended evaluation of INK-L on different datasets and models.
  • ...and 4 more figures