Table of Contents
Fetching ...

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

Shanmin Wang, Dongdong Zhao

TL;DR

BackWeak demonstrates that a strong, transferable backdoor in knowledge distillation can be implanted by fine-tuning a benign teacher with a visually stealthy weak trigger at a very small learning rate, avoiding surrogate models or distillation simulation. The method jointly optimizes a weak trigger via push and margin losses under an $\ell_\infty$ budget, poisons a subset of training data dynamically, and injects the backdoor by constrained fine-tuning of the feature extractor. Empirical results across CIFAR-10 and ImageNet-50 show high attack success rates with minimal benign accuracy loss, outperforming surrogate-based methods in stealth and efficiency, and revealing that prior gains often stem from strong UAP-like triggers rather than genuine backdoors. The work emphasizes the need to evaluate trigger stealthiness in KD security and provides a practical, scalable threat model for real-world model supply chains, with code released for reproducibility.

Abstract

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks -- most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and they construct triggers in a way similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers -- imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's stealthiness and its potential adversarial characteristics.

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

TL;DR

BackWeak demonstrates that a strong, transferable backdoor in knowledge distillation can be implanted by fine-tuning a benign teacher with a visually stealthy weak trigger at a very small learning rate, avoiding surrogate models or distillation simulation. The method jointly optimizes a weak trigger via push and margin losses under an budget, poisons a subset of training data dynamically, and injects the backdoor by constrained fine-tuning of the feature extractor. Empirical results across CIFAR-10 and ImageNet-50 show high attack success rates with minimal benign accuracy loss, outperforming surrogate-based methods in stealth and efficiency, and revealing that prior gains often stem from strong UAP-like triggers rather than genuine backdoors. The work emphasizes the need to evaluate trigger stealthiness in KD security and provides a practical, scalable threat model for real-world model supply chains, with code released for reproducibility.

Abstract

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks -- most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and they construct triggers in a way similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers -- imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's stealthiness and its potential adversarial characteristics.

Paper Structure

This paper contains 47 sections, 16 equations, 3 figures, 19 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed BackWeak workflow.
  • Figure 2: Visualization of the trigger and its application results, with LPIPS values (computed against the original images) displayed below each image obtained by applying the trigger.
  • Figure : Weak Trigger Generation Process.