Table of Contents
Fetching ...

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR

T2IShield tackles the vulnerability of text-to-image diffusion models to backdoors by identifying the Assimilation Phenomenon in cross-attention maps. It introduces two detection methods, Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis, followed by a binary-search localization procedure and mitigation via concept editing (Refact and UCE). Evaluations on Rickrolling and VillanDiffusion attacks show strong detection (around 89% F1), robust localization (about 86%), and substantial detoxification (≈99% ASR reduction) with Refact. The work provides a practical, real-time defense for T2I diffusion models and releases code to enable broader adoption and further research.

Abstract

While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of our proposed defense method. For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$\%$ with low computational cost. Furthermore, T2IShield achieves a localization F1 score of 86.4$\%$ and invalidates 99$\%$ poisoned samples. Codes are released at https://github.com/Robin-WZQ/T2IShield.

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

TL;DR

T2IShield tackles the vulnerability of text-to-image diffusion models to backdoors by identifying the Assimilation Phenomenon in cross-attention maps. It introduces two detection methods, Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis, followed by a binary-search localization procedure and mitigation via concept editing (Refact and UCE). Evaluations on Rickrolling and VillanDiffusion attacks show strong detection (around 89% F1), robust localization (about 86%), and substantial detoxification (≈99% ASR reduction) with Refact. The work provides a practical, real-time defense for T2I diffusion models and releases code to enable broader adoption and further research.

Abstract

While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of our proposed defense method. For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9 with low computational cost. Furthermore, T2IShield achieves a localization F1 score of 86.4 and invalidates 99 poisoned samples. Codes are released at https://github.com/Robin-WZQ/T2IShield.
Paper Structure (21 sections, 7 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: "Assimilation Phenomenon" on cross-attention maps of a T2I diffusion image generation caused by triggers. Each row represents the average maps for each word in the prompt that generated the image on the left. (Top): A benign sample. (Middle): A backdoor sample with the trigger "v", implanted by Rickrolling Struppek2022RickrollingTA. (Bottom): A backdoor sample with the trigger "latte", implanted by Villan Diffusion Chou2023VillanDiffusionAU. Note that the trigger is colored red.
  • Figure 2: Overview of our T2IShield. (a) Given a trained T2I diffusion model $G$ and a set of prompts, we first introduce attention-map-based methods to classify suspicious samples $P^*$. (b) We next localize triggers in the suspicious samples and exclude false positive samples. (c) Finally, we mitigate the poisoned impact of these triggers to obtain a detoxified model $\hat{G}$.
  • Figure 3: The feature probability density visualization for 3000 benign samples and 3000 backdoor samples Wang2022DiffusionDBAL. (a) Feature probability density computed by F Norm metrics. (b) Feature probability density computed by Riemannian metrics. The values for the benign samples are in blue, and those for the backdoor samples are in red.
  • Figure 4: Ablation Study for the F Norm Threshold Truncation and Covariance Discriminative Analysis.
  • Figure 5: Localization results on two similarity computing tools with five thresholds.
  • ...and 3 more figures