Table of Contents
Fetching ...

BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

Siyuan Liang, Yongcheng Jing, Yingjie Wang, Jiaxing Huang, Ee-chien Chang, Dacheng Tao

TL;DR

BadCLIP++ addresses stealthy and persistent backdoors in multimodal contrastive learning by jointly optimizing covert triggers and model parameters within a trust-region framework. It introduces a semantic-fusion QR visual trigger, target-aligned subset selection, and stability regularizers (radius shrinkage, centroid alignment, EWC) to sustain backdoor effects through fine-tuning and transfer. The authors provide a theoretical foundation showing gradient co-directionality between clean and backdoor objectives and derive a non-increasing bound on attack degradation, complemented by extensive experiments across architectures, datasets, and defenses that demonstrate near-perfect ASR at $0.3\%$ poisoning and robust physical-world performance. These results reveal significant security risks in multimodal systems and highlight the urgent need for robust defenses against covert, persistent cross-modal backdoors.

Abstract

Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.

BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

TL;DR

BadCLIP++ addresses stealthy and persistent backdoors in multimodal contrastive learning by jointly optimizing covert triggers and model parameters within a trust-region framework. It introduces a semantic-fusion QR visual trigger, target-aligned subset selection, and stability regularizers (radius shrinkage, centroid alignment, EWC) to sustain backdoor effects through fine-tuning and transfer. The authors provide a theoretical foundation showing gradient co-directionality between clean and backdoor objectives and derive a non-increasing bound on attack degradation, complemented by extensive experiments across architectures, datasets, and defenses that demonstrate near-perfect ASR at poisoning and robust physical-world performance. These results reveal significant security risks in multimodal systems and highlight the urgent need for robust defenses against covert, persistent cross-modal backdoors.

Abstract

Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.
Paper Structure (55 sections, 5 theorems, 70 equations, 10 figures, 19 tables, 1 algorithm)

This paper contains 55 sections, 5 theorems, 70 equations, 10 figures, 19 tables, 1 algorithm.

Key Result

Lemma 1

Consider the T2T loss $\mathcal{L}_{\mathrm{T2T}}$ defined in Eq. (t2t_loss). Suppose we perform gradient descent in embedding space with step size $\gamma>0$: If $0<\gamma<\tfrac{1}{2\lambda_{\mathrm{T2T}}}$, then the radius contracts at a fixed rate:

Figures (10)

  • Figure 1: The attacker injects stealthy poisoned pairs through trigger design, subset selection, and produces infected encoders in the victim model under the training control. BadCLIP++ is capable of bypassing training-time defenses liang2024unlearningkuang2024adversarialguo2024copyrightshieldxun2025robustxu2025srd, model-based defenses wang2025lie, and inference-time defenses wang2022universal.
  • Figure 2: The framework of BadCLIP++. The framework consists of (i) stealth-aware trigger design with stability losses, (ii) Greedy Mean Alignment for target-aligned subset selection, and (iii) model-level regularization that enforces image–text alignment to the target description while maintaining clean-task accuracy.
  • Figure 3: Visualization of DECREE-inverted triggers, where larger $\mathcal{P}_{CL^{-}\text{norm}}$ and $L_1$ norms indicate cleaner encoders with weaker backdoor traces.
  • Figure 4: Visualization of AUROC results in various inference-phase defenses.
  • Figure 5: Hyperparameter analysis of the four loss functions.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Lemma 1: Compactness reduction
  • Lemma 2: Centroid alignment
  • Theorem 1: Gradient Alignment
  • Lemma 3: ALIGN Curvature Control
  • Theorem 2: Local Stability Around the Poisoned Solution
  • proof : Proof
  • proof : Proof
  • proof
  • proof : Proof
  • proof : Proof
  • ...and 1 more