CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Hritik Bansal; Nishad Singhi; Yu Yang; Fan Yin; Aditya Grover; Kai-Wei Chang

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, Kai-Wei Chang

TL;DR

The paper addresses the vulnerability of multimodal contrastive models like CLIP to data-poisoning backdoors that forge spurious trigger-label associations. It proposes CleanCLIP, a finetuning framework that jointly leverages multimodal contrastive and self-supervised objectives to decouple image and text representations, thereby erasing backdoors while preserving benign performance. Empirical results show CleanCLIP substantially reduces attack success rates across multiple backdoor types, both on small-scale pretraining and OpenAI-style 400M-data settings, with supervised finetuning offering an even stronger defense. The work demonstrates practical, data-efficient defenses for open-vocabulary multimodal models and offers extensive ablations on self-supervision strength, dataset sources, and poisoning scale.

Abstract

Multimodal contrastive pretraining has been used to train multimodal representation models, such as CLIP, on large amounts of paired image-text data. However, previous studies have revealed that such models are vulnerable to backdoor attacks. Specifically, when trained on backdoored examples, CLIP learns spurious correlations between the embedded backdoor trigger and the target label, aligning their representations in the joint embedding space. Injecting even a small number of poisoned examples, such as 75 examples in 3 million pretraining data, can significantly manipulate the model's behavior, making it difficult to detect or unlearn such correlations. To address this issue, we propose CleanCLIP, a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks by independently re-aligning the representations for individual modalities. We demonstrate that unsupervised finetuning using a combination of multimodal contrastive and unimodal self-supervised objectives for individual modalities can significantly reduce the impact of the backdoor attack. Additionally, we show that supervised finetuning on task-specific labeled image data removes the backdoor trigger from the CLIP vision encoder. We show empirically that CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning. The code and checkpoints are available at https://github.com/nishadsinghi/CleanCLIP.

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 8 figures, 8 tables)

This paper contains 36 sections, 4 equations, 8 figures, 8 tables.

Introduction
Background & Preliminaries
Multimodal Contrastive Learning
Backdoor Attacks in Multimodal Contrastive Learning
CleanCLIP
Setup
CLIP Pretraining
Backdoor Attacks
CleanCLIP
Model Evaluation
Experiments
Effectiveness of CleanCLIP Against Backdoor Attacks
Comparison with Baselines
Poisoning CLIP Pretrained with 400M Data
Supervised Finetuning as a Defense Against Backdoor Attacks
...and 21 more sections

Figures (8)

Figure 1: (a) The strategy employed by the adversary to introduce backdoor attacks into the model. It injects a backdoor trigger to clean images and changes their corresponding captions to proxy captions for the target label (in this case, 'banana'). (b) At inference time, images containing the backdoor trigger are misclassified to the target label ('banana'). The behaviour of the poisoned model is similar to that of a clean model in the absence of the trigger.
Figure 2: The t-SNE plots illustrate the representations of clean (blue) and poisoned (orange) images from the CLIP vision encoder. We selected 500 clean images from the ImageNet-1K validation dataset and created the poisoned images by adding the Blended trigger chen2017targeted to each of them. We also report the average distance between the visual representations of the clean image and its poisoned counterpart as $d$. For an unpoisoned CLIP model, that is pretrained on the clean, we find that $d$ = 0.4. (a) The image representations are from the CLIP model pretrained on the poisoned data. (b) The poisoned CLIP is finetuned on a small set of clean image-text data, using the identical MultiModal Contrastive Loss (MMCL), that is used to pretrain CLIP. (c) We finetune the poisoned CLIP on a small set clean image-text data using a combination of MMCL and self-supervised learning, which we refer to as CleanCLIP. (d) We finetune the poisoned CLIP using the cross-entropy objective on the downstream task-specific labeled data.
Figure 3: Illustration of our CleanCLIP framework ($N=2$), which includes a multimodal objective to align images with their corresponding texts (left) and a self-supervised objective to align images and texts with their augmented versions (right), respectively.
Figure 4: Variation in attack success rate and clean accuracy with increasing strength of the self-supervision signal ($\lambda_2$). Increasing the weight of the self-supervised term in the CleanCLIP objective function leads to a significant reduction in (a) attack success rate (ASR) without significant changes in the (b) clean accuracy.
Figure 5: Examples of images poisoned using various backdoor attacks.
...and 3 more figures

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

TL;DR

Abstract

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)