Table of Contents
Fetching ...

ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training

Xin Yao, Haiyang Zhao, Yimin Chen, Jiawei Guo, Kecheng Huang, Ming Zhao

TL;DR

This work reveals a previously overlooked text-based threat surface in CLIP pre-training by introducing ToxicTextCLIP, a background-aware poisoned-text generator that aligns background content with a target class. It presents two key components—the background-aware target text selector and the background-driven poisoned text augmenter—to produce semantically consistent and diverse poisoned texts, enabling strong poisoning and backdoor effects. Empirical results across CC3M, CC12M, and YFCC datasets show high attack success (ASR up to 95.83%) and backdoor accuracy (Hit@1 up to 98.68%), with defenses like RoCLIP, SafeCLIP, and CleanCLIP largely ineffective. The work also provides defense directions emphasizing cross-modal semantic verification and language-model-based anomaly detection, underlining the need for modality-aware robustness in multimodal foundation models.

Abstract

The Contrastive Language-Image Pretraining (CLIP) model has significantly advanced vision-language modeling by aligning image-text pairs from large-scale web data through self-supervised contrastive learning. Yet, its reliance on uncurated Internet-sourced data exposes it to data poisoning and backdoor risks. While existing studies primarily investigate image-based attacks, the text modality, which is equally central to CLIP's training, remains underexplored. In this work, we introduce ToxicTextCLIP, a framework for generating high-quality adversarial texts that target CLIP during the pre-training phase. The framework addresses two key challenges: semantic misalignment caused by background inconsistency with the target class, and the scarcity of background-consistent texts. To this end, ToxicTextCLIP iteratively applies: 1) a background-aware selector that prioritizes texts with background content aligned to the target class, and 2) a background-driven augmenter that generates semantically coherent and diverse poisoned samples. Extensive experiments on classification and retrieval tasks show that ToxicTextCLIP achieves up to 95.83% poisoning success and 98.68% backdoor Hit@1, while bypassing RoCLIP, CleanCLIP and SafeCLIP defenses. The source code can be accessed via https://github.com/xinyaocse/ToxicTextCLIP/.

ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training

TL;DR

This work reveals a previously overlooked text-based threat surface in CLIP pre-training by introducing ToxicTextCLIP, a background-aware poisoned-text generator that aligns background content with a target class. It presents two key components—the background-aware target text selector and the background-driven poisoned text augmenter—to produce semantically consistent and diverse poisoned texts, enabling strong poisoning and backdoor effects. Empirical results across CC3M, CC12M, and YFCC datasets show high attack success (ASR up to 95.83%) and backdoor accuracy (Hit@1 up to 98.68%), with defenses like RoCLIP, SafeCLIP, and CleanCLIP largely ineffective. The work also provides defense directions emphasizing cross-modal semantic verification and language-model-based anomaly detection, underlining the need for modality-aware robustness in multimodal foundation models.

Abstract

The Contrastive Language-Image Pretraining (CLIP) model has significantly advanced vision-language modeling by aligning image-text pairs from large-scale web data through self-supervised contrastive learning. Yet, its reliance on uncurated Internet-sourced data exposes it to data poisoning and backdoor risks. While existing studies primarily investigate image-based attacks, the text modality, which is equally central to CLIP's training, remains underexplored. In this work, we introduce ToxicTextCLIP, a framework for generating high-quality adversarial texts that target CLIP during the pre-training phase. The framework addresses two key challenges: semantic misalignment caused by background inconsistency with the target class, and the scarcity of background-consistent texts. To this end, ToxicTextCLIP iteratively applies: 1) a background-aware selector that prioritizes texts with background content aligned to the target class, and 2) a background-driven augmenter that generates semantically coherent and diverse poisoned samples. Extensive experiments on classification and retrieval tasks show that ToxicTextCLIP achieves up to 95.83% poisoning success and 98.68% backdoor Hit@1, while bypassing RoCLIP, CleanCLIP and SafeCLIP defenses. The source code can be accessed via https://github.com/xinyaocse/ToxicTextCLIP/.

Paper Structure

This paper contains 32 sections, 7 equations, 7 figures, 15 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our ToxicTextCLIP framework.
  • Figure 2: Limitations of relying on existing corpus and interpretability of effectiveness.
  • Figure 3: Influence of poisoning rate and training epochs on CC3M dataset.
  • Figure 4: Framework of background information extraction.
  • Figure 5: Illustration of feature decoder structure.
  • ...and 2 more figures