Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

Wenhan Yang; Jingdong Gao; Baharan Mirzasoleiman

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

Wenhan Yang, Jingdong Gao, Baharan Mirzasoleiman

TL;DR

This work addresses the vulnerability of CLIP pre-training to targeted data poisoning and backdoor attacks, showing that extremely small poisoned fractions can derail model behavior. It introduces SafeClip, a defense that first warms up with unimodal contrastive learning to separate poisoned images and captions, then uses a low-learning-rate CLIP objective on all data, and finally partitions data into safe and risky sets via a Gaussian Mixture Model to apply CLIP training only on safe data while continuing unimodal training on risky data; the safe set gradually expands during training. Empirically, SafeClip dramatically reduces attack success rates to near zero across CC3M, VG, and MSCOCO while maintaining downstream zero-shot and linear-probe performance comparable to CLIP, outperforming RoCLIP in stability and efficacy. The approach is shown to be robust to adaptive attacks and stronger backdoor strategies, with moderate hardware overhead and scalable data usage, making pre-training of vision-language models safer for real-world deployment.

Abstract

Contrastive Language-Image Pre-training (CLIP) on large image-caption datasets has achieved remarkable success in zero-shot classification and enabled transferability to new domains. However, CLIP is extremely more vulnerable to targeted data poisoning and backdoor attacks, compared to supervised learning. Perhaps surprisingly, poisoning 0.0001% of CLIP pre-training data is enough to make targeted data poisoning attacks successful. This is four orders of magnitude smaller than what is required to poison supervised models. Despite this vulnerability, existing methods are very limited in defending CLIP models during pre-training. In this work, we propose a strong defense, SAFECLIP, to safely pre-train CLIP against targeted data poisoning and backdoor attacks. SAFECLIP warms up the model by applying unimodal contrastive learning (CL) on image and text modalities separately. Then, it divides the data into safe and risky sets, by applying a Gaussian Mixture Model to the cosine similarity of image-caption pair representations. SAFECLIP pre-trains the model by applying the CLIP loss to the safe set and applying unimodal CL to image and text modalities of the risky set separately. By gradually increasing the size of the safe set during pre-training, SAFECLIP effectively breaks targeted data poisoning and backdoor attacks without harming the CLIP performance. Our extensive experiments on CC3M, Visual Genome, and MSCOCO demonstrate that SAFECLIP significantly reduces the success rate of targeted data poisoning attacks from 93.75% to 0% and that of various backdoor attacks from up to 100% to 0%, without harming CLIP's performance.

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 4 figures, 15 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 4 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Preliminary
Contrastive Language-Image Pre-training (CLIP)
Targeted Data Poisoning and Backdoor Attacks
Method
Unimodal CL Warmup: Pushing Adversarial Captions away from Poisoned Images
Separating Safe & Risky (Potentially Poisoned) Data
Applying CLIP to Safe and CL to Risky Data
Experiments
SafeClip Defends CLIP & Preserves Performance
SafeClip Ablation Study and Sensitivity Analysis
SafeClip is Robust against Adaptive Attacks
SafeClip is Robust against Stronger Attacks
Conclusion
...and 4 more sections

Figures (4)

Figure 1: Cosine similarities between image-caption representations. While CLIP directly associate the poisoned image-caption pairs, SafeClip clusters the images and captions in the same category and pushes away poisoned pairs.
Figure 2: SafeClip fits a two-components Gaussian Mixture Model (GMM) to the post-warmup cosine similarity, selecting the safe set based on the chosen threshold $t$. This approach reduces the poison rate to as low as $3.75e^{-4}\%$.
Figure 3: Backdoor attacks used in our evaluations.
Figure 4: Distribution of Image-Caption Cosine Similarities After 1 epoch of Pre-Training with (a) CLIP and (b) SafeClip. While the poisoned pairs become indistinguishable from the clean pairs in CLIP, the warm-up helps SafeClip separate the clean data pairs from the poisoned data pairs. For clearer visualization, the distributions of poisoned and clean pairs are normalized.

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

TL;DR

Abstract

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)