Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective

Sahil Verma; Gantavya Bhatt; Avi Schwarzschild; Soumye Singhal; Arnav Mohanty Das; Chirag Shah; John P Dickerson; Pin-Yu Chen; Jeff Bilmes

Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective

Sahil Verma, Gantavya Bhatt, Avi Schwarzschild, Soumye Singhal, Arnav Mohanty Das, Chirag Shah, John P Dickerson, Pin-Yu Chen, Jeff Bilmes

TL;DR

This work investigates backdoor vulnerabilities in vision-language models trained on large web-sourced data and evaluates CleanCLIP as a post-hoc poison-removal method under different pre-training objectives. By comparing models trained with multimodal contrastive learning (MMCL) alone vs MMCL combined with intramodal self-supervised learning (SSL), across CC3M and CC6M datasets, the study demonstrates that CleanCLIP effectively cleans MMCL-only models but struggles with MMCL+SSL models, often incurring substantial losses in zero-shot accuracy. The authors also explore variations in poisoning, backbone architectures, data ideality, and stopping criteria, showing that even small amounts of poisoned data in the cleaning set can destabilize cleaning for the stronger objective. The findings highlight a practical vulnerability: stronger pre-training objectives that improve downstream accuracy simultaneously raise the hurdle for backdoor mitigation, underscoring the need for defense methods that are robust across pre-training setups and realistic data conditions.

Abstract

Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in multimodal models, such as CleanCLIP, which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives that lead to higher zero-shot classification performance correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP, even with extensive hyperparameter tuning, is ineffective in poison removal when stronger pre-training objectives are used. Our findings underscore critical considerations for ML practitioners who train models using large-scale web-curated data and are concerned about potential backdoor threats.

Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective

TL;DR

Abstract

Paper Structure (63 sections, 3 equations, 20 figures, 3 tables)

This paper contains 63 sections, 3 equations, 20 figures, 3 tables.

Introduction
Related Works
Contrastive Learning
Backdoor Attacks and Defense
Our Work
Why we choose to study CleanCLIP?
Methodology
Primer on Pre-training and Poisoning
Notations
Loss Objectives
Experimental Setup
Training Details
Poisoning
Removing poison
Experiments
...and 48 more sections

Figures (20)

Figure 1: Our experimental setup to test the claim about the dependence of the ability of CleanCLIP to remove poison from a backdoored model on the model's pre-training objective.
Figure 2: Top-1 zero-shot Imagenet validation set accuracy vs. the ASR, measured at the end of each cleaning epoch for the models trained on the CC6M dataset. The cleaning is done by finetuning the model with the three losses mentioned above. The red star in the top right corner (encircled in the black circle) corresponds to the model's starting accuracy and ASR (before cleaning). For a successful cleaning, there should be models that maintain the model's starting accuracy while having a low ASR (indicated by the red circle's region in the top left). There are several models in the red circle in the left plot (successful clean), while there are no models in the red circle in the right plot (unsuccessful clean). Takeaway: CleanCLIP successfully cleans the model trained with $\mathcal{L}^{pre}_{\text{MMCL}}$ (left), while it is ineffective for the models trained with $\mathcal{L}^{pre}_{\text{MMCL}} + \mathcal{L}^{pre}_{\text{SSL}}$ (right).
Figure 3: Top-1 zero-shot Imagenet validation set accuracy vs. the ASR, measured at the end of each cleaning epoch for the models poisoned by finetuning a CLIP pre-trained checkpoint on the CC6M dataset. The cleaning is done by finetuning the poisoned model with $\mathcal{L}^{ft}_{\text{MMCL}} + \mathcal{L}^{ft}_{\text{SSL}}$. The red star in the top right corner (encircled in the black circle) corresponds to the original model's accuracy and ASR (before cleaning). For a successful cleaning, there should be models that maintain the original model's accuracy while having a low ASR (indicated by the red circle in the top left). Takeaway: CleanCLIP is unable to successfully clean both the models; however, it performs much worse for the model poisoned with $\mathcal{L}^{pre}_{\text{MMCL}} + \mathcal{L}^{pre}_{\text{SSL}}$ (right).
Figure 4: Top-1 zero-shot Imagenet validation set accuracy vs. the ASR, measured at the end of each cleaning epoch for the models trained on the CC6M dataset. The cleaning using CleanCLIP. The red star in the top right corner (encircled in the black circle) corresponds to the model's starting accuracy and ASR (before cleaning). Takeaway: When a ViT backbone is poisoned, there are no cleaned models that maintain the original accuracies for both the pre-training losses, however the drop is much larger for the model trained with $\mathcal{L}^{pre}_{\text{MMCL}} + \mathcal{L}^{pre}_{\text{SSL}}$ (right).
Figure 5: Finetuning trajectories of models with different pre-training objectives. Successive finetuning epochs are shown with increasing size of the markers and intensity of the connecting line. The red star in the top right corner (encircled in the black circle) corresponds to the original model's accuracy and ASR. Takeaway: Models trained with $\mathcal{L}^{pre}_{\text{MMCL}}$ converge to a region of high accuracy and low ASR as we continue to finetune. On the other hand, models trained with $\mathcal{L}^{pre}_{\text{MMCL}} + \mathcal{L}^{pre}_{\text{SSL}}$ fail to converge to a region of high accuracy and low ASR, and continued finetuning can lead to both decreased accuracy and higher ASR. This makes determining the stopping criterion for the cleaning process for the latter models challenging.
...and 15 more figures

Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective

TL;DR

Abstract

Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective

Authors

TL;DR

Abstract

Table of Contents

Figures (20)