Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?
TIngxu Han, Wei Song, Weisong Sun, Ziqi Ding, Yebo Feng, Chunrong Fang, Jun Li, Hanwei Qian, Zhenyu Chen, Yang Liu
TL;DR
This work tackles the security risk of backdoors in self-supervised learning (SSL) by evaluating distillation as a defense against poisoned pre-trained encoders. By treating distillation as a mechanism to extract benign knowledge from a compromised encoder, the authors demonstrate substantial mitigation: reducing attack success rate from $80.87\%$ to $27.51\%$ with a modest $6.35\%$ drop in accuracy. The defense hinges on a three-way design—fine-tuned teacher networks, warm-up trained student models, and attention-based distillation losses—while showing robustness across trigger sizes, architectures, and pre-training algorithms, and extending to advanced attacks. These results suggest distillation can be a practical, general defense in SSL ecosystems where third-party pre-trained encoders are widely deployed, with future work aimed at further disentangling backdoor features for stronger purification.
Abstract
Self-Supervised Learning (SSL) has become a prominent paradigm for pre-training encoders to learning general-purpose representations from unlabeled data and releasing them on third-party platforms for broad downstream deep learning tasks. However, SSL is vulnerable to backdoor attacks, where an adversary may train and distribute poisoned pre-training encoders to contaminate the downstream models. In this paper, we study a defense mechanism based on distillation against poisoned encoders in SSL. Traditionally, distillation transfers knowledge from a pre-trained teacher model to a student model, enabling the student to replicate or refine the teacher's learned representations. We repurpose distillation to extract benign knowledge and remove backdoors from a poisoned pre-trained encoder to produce a clean and reliable pre-trained model. We conduct extensive experiments to evaluate the effectiveness of distillation in mitigating backdoor attacks on pre-trained encoders. Based on two state-of-the-art backdoor attacks and four widely adopted image classification datasets, our results demonstrate that distillation reduces the attack success rate from 80.87% to 27.51%, with only a 6.35% drop in model accuracy. Furthermore, by comparing four teacher architectures, three student models, and six loss functions, we find that the distillation with fine-tuned teacher networks, warm-up-based student training, and attention-based distillation losses yield the best performance.
