REFINE: Inversion-Free Backdoor Defense via Model Reprogramming
Yukun Chen, Shuo Shao, Enhao Huang, Yiming Li, Pin-Yu Chen, Zhan Qin, Kui Ren
TL;DR
Backdoor defenses are hindered by a trade-off between preserving benign accuracy and removing malicious triggers, particularly for transformation-based and BTI-based approaches. REFINE introduces an inversion-free defense based on model reprogramming, combining a trainable input transformation with a hard-coded output remapping, and augments learning with a supervised contrastive loss to widen class separation. A theoretical bound links defense effectiveness to the Wasserstein-1 distance between output representations, motivating the reprogramming strategy that changes the output domain to amplify input-disruption. Empirical results across CIFAR-10 and a 50-class ImageNet subset show REFINE achieving ASR below ~3% with BA drops under ~3% on CIFAR-10 and even improved BA on ImageNet, under diverse attacks and under adaptive threat scenarios. The work provides a practical, efficient defense for third-party pretrained models and offers a framework for extending model reprogramming-based defenses to other modalities.
Abstract
Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.
