Table of Contents
Fetching ...

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Tianxin Chen, Wenbo Jiang, Hongqiao Chen, Zhirun Zheng, Cheng Huang

TL;DR

This work introduces SemBD, a semantic-level backdoor attack against text-to-image diffusion models that operates in continuous semantic representation space rather than discrete text triggers. By distilling cross-attention projections (K and V) to align triggers with multi-entity semantic targets and applying semantic regularization to bound activation under incomplete semantics, SemBD achieves robust activation across semantically equivalent prompts while evading prompt enumeration and attention-based defenses. The approach yields a 100% attack success rate and maintains strong image-quality and semantic-accuracy metrics, with significantly reduced detectability under state-of-the-art input-level defenses and resilience to fine-tuning. The findings highlight a need for defenses that model and monitor semantic representations and cross-modal alignment, not just surface-form prompts.

Abstract

Text-to-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks. Existing attacks typically rely on fixed textual triggers and single-entity backdoor targets, making them highly susceptible to enumeration-based input defenses and attention-consistency detection. In this work, we propose Semantic-level Backdoor Attack (SemBD), which implants backdoors at the representation level by defining triggers as continuous semantic regions rather than discrete textual patterns. Concretely, SemBD injects semantic backdoors by distillation-based editing of the key and value projection matrices in cross-attention layers, enabling diverse prompts with identical semantic compositions to reliably activate the backdoor attack. To further enhance stealthiness, SemBD incorporates a semantic regularization to prevent unintended activation under incomplete semantics, as well as multi-entity backdoor targets that avoid highly consistent cross-attention patterns. Extensive experiments demonstrate that SemBD achieves a 100% attack success rate while maintaining strong robustness against state-of-the-art input-level defenses.

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

TL;DR

This work introduces SemBD, a semantic-level backdoor attack against text-to-image diffusion models that operates in continuous semantic representation space rather than discrete text triggers. By distilling cross-attention projections (K and V) to align triggers with multi-entity semantic targets and applying semantic regularization to bound activation under incomplete semantics, SemBD achieves robust activation across semantically equivalent prompts while evading prompt enumeration and attention-based defenses. The approach yields a 100% attack success rate and maintains strong image-quality and semantic-accuracy metrics, with significantly reduced detectability under state-of-the-art input-level defenses and resilience to fine-tuning. The findings highlight a need for defenses that model and monitor semantic representations and cross-modal alignment, not just surface-form prompts.

Abstract

Text-to-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks. Existing attacks typically rely on fixed textual triggers and single-entity backdoor targets, making them highly susceptible to enumeration-based input defenses and attention-consistency detection. In this work, we propose Semantic-level Backdoor Attack (SemBD), which implants backdoors at the representation level by defining triggers as continuous semantic regions rather than discrete textual patterns. Concretely, SemBD injects semantic backdoors by distillation-based editing of the key and value projection matrices in cross-attention layers, enabling diverse prompts with identical semantic compositions to reliably activate the backdoor attack. To further enhance stealthiness, SemBD incorporates a semantic regularization to prevent unintended activation under incomplete semantics, as well as multi-entity backdoor targets that avoid highly consistent cross-attention patterns. Extensive experiments demonstrate that SemBD achieves a 100% attack success rate while maintaining strong robustness against state-of-the-art input-level defenses.
Paper Structure (26 sections, 4 theorems, 49 equations, 13 figures, 4 tables)

This paper contains 26 sections, 4 theorems, 49 equations, 13 figures, 4 tables.

Key Result

Theorem 2.3

Under Assumption assump:semantic_stability, for any semantic-equivalent prompts $y,y' \in \mathcal{P}(s)$, $\|K(y)-K(y')\|_F \le \varepsilon_{\mathrm{sem}} \, \|\mathbf{W}_k\|_F,$ and $\|V(y)-V(y')\|_F \le \varepsilon_{\mathrm{sem}} \, \|\mathbf{W}_v\|_F.$

Figures (13)

  • Figure 1: Cross-attention maps of a benign prompt and triggered prompts under different backdoor attacks in a T2I diffusion model. Each row corresponds to a specific attack method. Trigger tokens are highlighted in red.
  • Figure 2: Semantic similarity across different representation spaces in a benign T2I diffusion model. We use a fixed set of 11 semantically equivalent textual prompts with different surface forms, as presented in \ref{['tab:sem-eq-prompts']} in Appendix \ref{['app:semantically-equivalent-prompts']}.
  • Figure 3: The overview of our backdoor attack method SemBD. (a) Semantic Trigger Construction. Triggers are defined in a semantic space by subject, action, object, and scene, instantiated via semantically equivalent prompts. (b) Semantic Regularization. Substrings of different lengths constrain activation under incomplete semantics. (c) Multi-Entity Backdoor Target Design. Each semantic trigger is associated with multiple related target entities to avoid cross-attention consistency. (d) Semantic Backdoor Injection. The backdoor is injected by semantically aligning the cross-attention key and value representations of the trigger prompt with those of the target prompt.
  • Figure 4: Different textual realizations that share the same underlying semantics reliably trigger the backdoor in both SDv1.5 and SDXL, while the benign models remain unaffected.
  • Figure 5: Under normal prompts that do not contain the semantic trigger, the backdoored models behave similarly to the benign models for both SDv1.5 and SDXL.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Theorem 2.3: Semantic generalization of key and value projections
  • Corollary 2.4: Semantic stability of cross-attention output
  • proof
  • Theorem 3.4
  • Corollary 3.5
  • proof