Table of Contents
Fetching ...

REFINE: Inversion-Free Backdoor Defense via Model Reprogramming

Yukun Chen, Shuo Shao, Enhao Huang, Yiming Li, Pin-Yu Chen, Zhan Qin, Kui Ren

TL;DR

Backdoor defenses are hindered by a trade-off between preserving benign accuracy and removing malicious triggers, particularly for transformation-based and BTI-based approaches. REFINE introduces an inversion-free defense based on model reprogramming, combining a trainable input transformation with a hard-coded output remapping, and augments learning with a supervised contrastive loss to widen class separation. A theoretical bound links defense effectiveness to the Wasserstein-1 distance between output representations, motivating the reprogramming strategy that changes the output domain to amplify input-disruption. Empirical results across CIFAR-10 and a 50-class ImageNet subset show REFINE achieving ASR below ~3% with BA drops under ~3% on CIFAR-10 and even improved BA on ImageNet, under diverse attacks and under adaptive threat scenarios. The work provides a practical, efficient defense for third-party pretrained models and offers a framework for extending model reprogramming-based defenses to other modalities.

Abstract

Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.

REFINE: Inversion-Free Backdoor Defense via Model Reprogramming

TL;DR

Backdoor defenses are hindered by a trade-off between preserving benign accuracy and removing malicious triggers, particularly for transformation-based and BTI-based approaches. REFINE introduces an inversion-free defense based on model reprogramming, combining a trainable input transformation with a hard-coded output remapping, and augments learning with a supervised contrastive loss to widen class separation. A theoretical bound links defense effectiveness to the Wasserstein-1 distance between output representations, motivating the reprogramming strategy that changes the output domain to amplify input-disruption. Empirical results across CIFAR-10 and a 50-class ImageNet subset show REFINE achieving ASR below ~3% with BA drops under ~3% on CIFAR-10 and even improved BA on ImageNet, under diverse attacks and under adaptive threat scenarios. The work provides a practical, efficient defense for third-party pretrained models and offers a framework for extending model reprogramming-based defenses to other modalities.

Abstract

Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.

Paper Structure

This paper contains 47 sections, 2 theorems, 23 equations, 11 figures, 14 tables, 1 algorithm.

Key Result

Theorem 1

Given a $K$-class pre-trained deep learning model $\mathcal{F}(\cdot)=s(f(\cdot))$ where $s(\cdot)$ is the softmax function and $f(\cdot)$ is the feature extractor, and a pre-processing method $\mathcal{T}(\cdot)$, $\bm{x}$ is the data from a specific domain $\mathcal{D}$ ($i.e.$, $\bm{x}\sim \mathc where $\mathcal{W}_1(\mu, \tilde{\mu})$ is the Wasserstein-1 distance between $\mu$ and $\tilde{\mu

Figures (11)

  • Figure 1: The defense process of our REFINE. The label remapping in the model's output domain significantly enhances the flexibility of input transformations while maintaining consistent sample predictions, effectively mitigating the trade-off often encountered in transformation-based pre-processing defenses. During prediction, the input sequentially passes through the well-trained input transformation module, the fixed backdoored model, and the pre-defined output mapping module, ultimately yielding the expected ground-truth (instead of the malicious target) label.
  • Figure 2: (a-1)&(b-1): The ASR and BA for ShrinkPad (the first row) and BDMAE (the second row) with different transformation intensities. (a-2)$\sim$(a-4)&(b-2)$\sim$(b-4): The t-SNE plots of the features of benign and backdoor samples under no defense (dubbed "ND"), low transformation intensity (dubbed "Low"), and high transformation intensity (dubbed "High"). Squares and solid circles represent the centroids of benign sample distributions and backdoor sample distributions. As the transformation intensity increases, the features of benign samples deviate from the origin. The results demonstrate the tradeoff faced by the transformation-based backdoor defense methods.
  • Figure 3: The visualization of BTI-DBF in inverting backdoor triggers under both BadNets and Blended attacks. We display the poisoned, inverted, and purified samples, respectively.
  • Figure 4: The main optimization pipeline of our REFINE. There are two main components: input transformation module $\mathcal{T}$ and output mapping module $\mathcal{M}$. Specifically, after obtaining the fixed pre-trained model, the defender first specifies a particular hard-coded mapping $\mathcal{M}$ and then optimizes $\mathcal{T}$ guided by the loss function $\mathcal{L}$, using the unlabeled benign dataset. The loss function $\mathcal{L}$ consists of the cross-entropy loss $\mathcal{L}_{ce}$ which aims to maintain the model's utility, and the supervised contrastive loss $\mathcal{L}_{sup}$ to enhance the defense capability via forcing orderly sample aggregation.
  • Figure 5: The illustration of the adopted backdoor attacks.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof