Table of Contents
Fetching ...

Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets

Yechao Zhang, Yuxuan Zhou, Tianyu Li, Minghui Li, Shengshan Hu, Wei Luo, Leo Yu Zhang

TL;DR

This paper tackles backdoor risks in transfer learning where both pre-trained encoders and downstream data may be poisoned. It argues that reactive defenses struggle under unknown threat models and introduces a proactive Trusted Core Bootstrapping (T-Core) framework that identifiess trust-worthy data and encoder channels to bootstrap a clean model. T-Core combines seed data sifting via topological invariance, seed expansion with confusion training, selective encoder channel filtering, and progressive bootstrapping learning, and it is shown to outperform numerous baselines across multiple attacks, datasets, and even ViT architectures. The proposed approach delivers practical security improvements for edge TL with modest computational demands, highlighting a shift toward proactive, trust-based defense strategies in backdoor research.

Abstract

Transfer learning from pre-trained encoders has become essential in modern machine learning, enabling efficient model adaptation across diverse tasks. However, this combination of pre-training and downstream adaptation creates an expanded attack surface, exposing models to sophisticated backdoor embeddings at both the encoder and dataset levels--an area often overlooked in prior research. Additionally, the limited computational resources typically available to users of pre-trained encoders constrain the effectiveness of generic backdoor defenses compared to end-to-end training from scratch. In this work, we investigate how to mitigate potential backdoor risks in resource-constrained transfer learning scenarios. Specifically, we conduct an exhaustive analysis of existing defense strategies, revealing that many follow a reactive workflow based on assumptions that do not scale to unknown threats, novel attack types, or different training paradigms. In response, we introduce a proactive mindset focused on identifying clean elements and propose the Trusted Core (T-Core) Bootstrapping framework, which emphasizes the importance of pinpointing trustworthy data and neurons to enhance model security. Our empirical evaluations demonstrate the effectiveness and superiority of T-Core, specifically assessing 5 encoder poisoning attacks, 7 dataset poisoning attacks, and 14 baseline defenses across five benchmark datasets, addressing four scenarios of 3 potential backdoor threats.

Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets

TL;DR

This paper tackles backdoor risks in transfer learning where both pre-trained encoders and downstream data may be poisoned. It argues that reactive defenses struggle under unknown threat models and introduces a proactive Trusted Core Bootstrapping (T-Core) framework that identifiess trust-worthy data and encoder channels to bootstrap a clean model. T-Core combines seed data sifting via topological invariance, seed expansion with confusion training, selective encoder channel filtering, and progressive bootstrapping learning, and it is shown to outperform numerous baselines across multiple attacks, datasets, and even ViT architectures. The proposed approach delivers practical security improvements for edge TL with modest computational demands, highlighting a shift toward proactive, trust-based defense strategies in backdoor research.

Abstract

Transfer learning from pre-trained encoders has become essential in modern machine learning, enabling efficient model adaptation across diverse tasks. However, this combination of pre-training and downstream adaptation creates an expanded attack surface, exposing models to sophisticated backdoor embeddings at both the encoder and dataset levels--an area often overlooked in prior research. Additionally, the limited computational resources typically available to users of pre-trained encoders constrain the effectiveness of generic backdoor defenses compared to end-to-end training from scratch. In this work, we investigate how to mitigate potential backdoor risks in resource-constrained transfer learning scenarios. Specifically, we conduct an exhaustive analysis of existing defense strategies, revealing that many follow a reactive workflow based on assumptions that do not scale to unknown threats, novel attack types, or different training paradigms. In response, we introduce a proactive mindset focused on identifying clean elements and propose the Trusted Core (T-Core) Bootstrapping framework, which emphasizes the importance of pinpointing trustworthy data and neurons to enhance model security. Our empirical evaluations demonstrate the effectiveness and superiority of T-Core, specifically assessing 5 encoder poisoning attacks, 7 dataset poisoning attacks, and 14 baseline defenses across five benchmark datasets, addressing four scenarios of 3 potential backdoor threats.

Paper Structure

This paper contains 35 sections, 6 equations, 7 figures, 19 tables, 4 algorithms.

Figures (7)

  • Figure 1: t-SNE comparison of feature space from a model trained on poisoned CIFAR-10: contrasting fine-tuning the entire network (FT-all) with fine-tuning only the 3-layer classification head (FT-head) under Threat-2.
  • Figure 2: Distribution of poisoned and clean samples in the low-loss region (lowest 40% loss of the training set) after Confusion Training (CT), contrasting results from fine-tuning the entire network $h$ (FT-all) and just the 3-layer classification head $f$ (FT-head) under Threat-2.
  • Figure 3: Comparison of average training losses for poisoned and clean samples in early epochs: SL trains from scratch on a poisoned dataset (BadNets or Blended). Threat-2 fine-tunes a classifier head after a clean encoder on a poisoned dataset. Threat-3 fine-tunes a classifier head after a poisoned encoder (BadEncoder or DRURE) on a poisoned dataset with the same trigger.
  • Figure 4: CLP performance on different types of threat from an omniscient defender's perspective: (a) CLP is applied to the encoder $g$ only because the pre-trained encoder is poisoned and the downstream dataset $\mathcal{D}$ is clean under $\textbf{\underline{Threat-1}}$; (b) CLP is applied to the linear layers of the classification head $f$ because the encoder $g$ is clean and only $f$ is fine-tuned over a poisoned $\mathcal{D}$ under $\textbf{\underline{Threat-2}}$; (c) CLP is applied to $f$ and $g$ since both encoder and dataset are poisoned under $\textbf{\underline{Threat-3}}$.
  • Figure 5: The scatter plot of the upper bound of activation changes (UCLC) versus actual triggered-activation changes (TAC) for all channels in the last four convolution layers. $corr$ presents the Pearson Correlation Coefficient. (a) depicts an end-to-end SL-trained ResNet18 classifier with BadNets dataset poisoning. (b) illustrates a ResNet18 encoder injected by DRUPE through encoder poisoning.
  • ...and 2 more figures