Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?

TIngxu Han; Wei Song; Weisong Sun; Ziqi Ding; Yebo Feng; Chunrong Fang; Jun Li; Hanwei Qian; Zhenyu Chen; Yang Liu

Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?

TIngxu Han, Wei Song, Weisong Sun, Ziqi Ding, Yebo Feng, Chunrong Fang, Jun Li, Hanwei Qian, Zhenyu Chen, Yang Liu

TL;DR

This work tackles the security risk of backdoors in self-supervised learning (SSL) by evaluating distillation as a defense against poisoned pre-trained encoders. By treating distillation as a mechanism to extract benign knowledge from a compromised encoder, the authors demonstrate substantial mitigation: reducing attack success rate from $80.87\%$ to $27.51\%$ with a modest $6.35\%$ drop in accuracy. The defense hinges on a three-way design—fine-tuned teacher networks, warm-up trained student models, and attention-based distillation losses—while showing robustness across trigger sizes, architectures, and pre-training algorithms, and extending to advanced attacks. These results suggest distillation can be a practical, general defense in SSL ecosystems where third-party pre-trained encoders are widely deployed, with future work aimed at further disentangling backdoor features for stronger purification.

Abstract

Self-Supervised Learning (SSL) has become a prominent paradigm for pre-training encoders to learning general-purpose representations from unlabeled data and releasing them on third-party platforms for broad downstream deep learning tasks. However, SSL is vulnerable to backdoor attacks, where an adversary may train and distribute poisoned pre-training encoders to contaminate the downstream models. In this paper, we study a defense mechanism based on distillation against poisoned encoders in SSL. Traditionally, distillation transfers knowledge from a pre-trained teacher model to a student model, enabling the student to replicate or refine the teacher's learned representations. We repurpose distillation to extract benign knowledge and remove backdoors from a poisoned pre-trained encoder to produce a clean and reliable pre-trained model. We conduct extensive experiments to evaluate the effectiveness of distillation in mitigating backdoor attacks on pre-trained encoders. Based on two state-of-the-art backdoor attacks and four widely adopted image classification datasets, our results demonstrate that distillation reduces the attack success rate from 80.87% to 27.51%, with only a 6.35% drop in model accuracy. Furthermore, by comparing four teacher architectures, three student models, and six loss functions, we find that the distillation with fine-tuned teacher networks, warm-up-based student training, and attention-based distillation losses yield the best performance.

Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?

TL;DR

with a modest

drop in accuracy. The defense hinges on a three-way design—fine-tuned teacher networks, warm-up trained student models, and attention-based distillation losses—while showing robustness across trigger sizes, architectures, and pre-training algorithms, and extending to advanced attacks. These results suggest distillation can be a practical, general defense in SSL ecosystems where third-party pre-trained encoders are widely deployed, with future work aimed at further disentangling backdoor features for stronger purification.

Abstract

Paper Structure (12 sections, 3 equations, 7 figures, 8 tables)

This paper contains 12 sections, 3 equations, 7 figures, 8 tables.

Introduction
Background & Related Work
Backdoor Attack
Backdoor Defense
Threat Model
Defense Mechanism
Overview
Evaluation
Experiment Setup
Results and Analysis
Conclusions
Data availability

Figures (7)

Figure 1: Overview of backdoor attacks at encoder and data levels. ① Encoder-level poisoning: an adversary fine-tunes a clean encoder with poisoned objectives to implant a trigger–target mapping. ② Data-level poisoning: the attacker inserts triggered samples into the pre-training dataset, causing the encoder to learn the backdoor implicitly. During downstream fine-tuning, the poisoned encoder makes classifiers misclassify any triggered input into the attacker-specified target class.
Figure 2: Framework of distillation-based poisoning mitigation. A poisoned encoder and clean data are used to train a student encoder under the guidance of a teacher network through distillation. ① The teacher network provides feature representations of clean data. ② The student network learns to align with the teacher's benign representations via distillation, filtering out the backdoor inherited from the poisoned encoder. The resulting student encoder becomes a clean, reliable pre-trained model for downstream tasks.
Figure 3: Effect of distillation epochs
Figure 4: Examples from the SVHN dataset illustrating that each image may contain multiple digits.
Figure 5: The performance of student nets in distillation.
...and 2 more figures

Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?

TL;DR

Abstract

Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)