Table of Contents
Fetching ...

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Naman Deep Singh, Francesco Croce, Matthias Hein

TL;DR

The study addresses the vulnerability of CLIP-like vision-language models to backdoor attacks and demonstrates that prior augmentation-based defenses fail against structured triggers. It introduces Perturb and Recover (PAR), a simple fine-tuning objective that decouples backdoor memorization from clean performance by perturbing embeddings away from the poisoned state while preserving CLIP alignment, formalized as L_PAR = L_CLIP − L_PERT. PAR shows strong backdoor removal across multiple encoders and trigger types, with a tunable threshold tau that balances clean accuracy and ASR; it remains effective even with synthetic data alone. The work highlights the practical viability of using synthetic data (SynthCLIP) for backdoor cleaning, reducing data collection costs, and providing a general defense against diverse backdoor strategies in multimodal systems.

Abstract

Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available at https://github.com/nmndeep/PerturbAndRecover.

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

TL;DR

The study addresses the vulnerability of CLIP-like vision-language models to backdoor attacks and demonstrates that prior augmentation-based defenses fail against structured triggers. It introduces Perturb and Recover (PAR), a simple fine-tuning objective that decouples backdoor memorization from clean performance by perturbing embeddings away from the poisoned state while preserving CLIP alignment, formalized as L_PAR = L_CLIP − L_PERT. PAR shows strong backdoor removal across multiple encoders and trigger types, with a tunable threshold tau that balances clean accuracy and ASR; it remains effective even with synthetic data alone. The work highlights the practical viability of using synthetic data (SynthCLIP) for backdoor cleaning, reducing data collection costs, and providing a general defense against diverse backdoor strategies in multimodal systems.

Abstract

Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available at https://github.com/nmndeep/PerturbAndRecover.

Paper Structure

This paper contains 26 sections, 8 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: PAR cleans better than previous methods. We show clean accuracy (CA) and attack success rate (ASR) for the poisoned model (CLIP) and after cleaning with RoCLIP yang2024robust, CleanCLIP bansal2023cleanclip and our novel PAR. While CleanCLIP and RoCLIP work well for known triggers, they perform worse for our novel (harder) structured triggers with RoCLIP suffering the most degradation in CA. PAR is the best backdoor defense across attacks and triggers while maintaining high CA.
  • Figure 2: Visualizing different backdoor patterns. Standard BadNet gu2017badnets and Blended chen2017targeted use Gaussian noise as a trigger, we replace the noise with random stripped pattern for BadNet termed BadNet-Stripes. For the Blended attack, we further replace the random noise with stripes, low contrast triangles (Blended-Tri.) and "Watermarked" text (Blended-Text), xmore visualizations in \ref{['fig:vis-backdoor-app']}. Note: this is a very small subset of possible structured patterns, and we believe similar other patterns would be equally effective.
  • Figure 3: ASR v Clean accuracy trade-off for BadNet-Stripes cleaned RN50. We plot attack success rate (ASR) against clean accuracy on ImageNet for different strength of the uni-modal augmentation loss of CleanCLIP and different threshold ($\tau$) for our PAR loss with clean (CC3M) and synthetic (SynC) data. CleanCLIP is unable to clean the model for the proposed "Stripes" trigger pattern, which is quite different from the employed augmentation set. In contrast PAR completely cleans the model of backdoor even with just synthetically generated (SynC) clean data.
  • Figure 4: Training dynamics of PAR and visualizations of image embeddings across cleaning methods for Blended-Text poisoned RN50. In the top left plot, we show how the $\mathcal{L}_{\text{CLIP}}$ and $\mathcal{L}_{\text{PERT}}$ ($\tau=2.15$) loss terms develop over training steps (evaluated every 25 steps) for Blended-Text poisoned RN50. Even though the schedule was optimized for BadNet-Stripes poisoned RN50, in the top right plot, we see how the training schedule generalizes by plotting clean accuracy and ASR (evaluated on $10k$ samples from ImageNet). In the bottom row, we visualize the t-SNE tsnevandermaaten08a projections of the same Blended-Text poisoned CLIP, clean finetuned by CleanCLIP and finetuned by PAR. Overall PAR yields the best mixing of clean and backdoored samples. Better mix means the model sees the clean and backdoored samples similarly, which also translates to low ASR. Similar visualizations for other attacks can be found in \ref{['app:visualize']}.
  • Figure 5: ASR for different poisoning rates of CleanCLIP and PAR for RN50. Even at a lower poisoning rate of 0.05%, BadNet-Stripes achieves 92% attack success rate (ASR). Overall across all poisoning rates, PAR cleans better than CleanCLIP.
  • ...and 4 more figures