Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Ziyuan Chen; Yujin Jeong; Tobias Braun; Anna Rohrbach

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Ziyuan Chen, Yujin Jeong, Tobias Braun, Anna Rohrbach

TL;DR

It is demonstrated that tuning fewer than 0.2% of the total encoder parameters is sufficient for successful backdoor attacks on Stable Diffusion 3, revealing previously underexplored vulnerabilities in practical attack scenarios in multi-encoder settings.

Abstract

As text-to-image diffusion models become increasingly deployed in real-world applications, concerns about backdoor attacks have gained significant attention. Prior work on text-based backdoor attacks has largely focused on diffusion models conditioned on a single lightweight text encoder. However, more recent diffusion models that incorporate multiple large-scale text encoders remain underexplored in this context. Given the substantially increased number of trainable parameters introduced by multiple text encoders, an important question is whether backdoor attacks can remain both efficient and effective in such settings. In this work, we study Stable Diffusion 3, which uses three distinct text encoders and has not yet been systematically analyzed for text-encoder-based backdoor vulnerabilities. To understand the role of text encoders in backdoor attacks, we define four categories of attack targets and identify the minimal sets of encoders required to achieve effective performance for each attack objective. Based on this, we further propose Multi-Encoder Lightweight aTtacks (MELT), which trains only low-rank adapters while keeping the pretrained text encoder weight frozen. We demonstrate that tuning fewer than 0.2% of the total encoder parameters is sufficient for successful backdoor attacks on Stable Diffusion 3, revealing previously underexplored vulnerabilities in practical attack scenarios in multi-encoder settings.

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 7 figures, 4 tables)

This paper contains 21 sections, 4 equations, 7 figures, 4 tables.

Introduction
Background and Related Work
Diffusion models
Text encoders in text-to-image diffusion models
Backdoor attacks on text encoders in diffusion models
Approach
Threat model
Efficient backdoor attacks
A taxonomy of backdoor attack targets
Experiments
Experimental Setup
Minimal encoder sets for effective backdoor attacks
MELT: Multi-Encoder Lightweight aTtacks
Conclusion
Triggers and Targets for each attack type.
...and 6 more sections

Figures (7)

Figure 1: Text-encoder backdoors in multi-encoder diffusion models pose two challenges:(i) Multi-encoder vulnerability: which encoder(s) must be tuned to implant a reliable backdoor? (ii) High Tuning Cost: the large number of text encoder parameters make tuning expensive. (Top) In Stable Diffusion 1.5, poisoning the single text encoder suffices: inserting the trigger cyrillic "o" into "A dog on the bench." induces a bird on the bench. (Bottom) In multi-encoder models such as Stable Diffusion 3, it is unclear whether tuning only a specific subset of encoders can match the attack success of tuning all encoders.
Figure 2: Backdoor injection in multi-encoder diffusion pipelines. Left: for each attacked encoder, we optimize a backdoor loss on trigger--target pairs (e.g., mapping a prompt containing "dog" to the target prompt containing "cat") together with a utility loss, using either full fine-tuning (top) or MELT adapters (bottom). Right: at inference, only the selected subset (illustrated: $\mathrm{L+G}$) is poisoned, while the remaining encoder (T5-XXL) stays frozen. Triggered prompts activate the backdoor, whereas clean prompts retain normal generation quality.
Figure 3: Backdoor Attack Types Across semantic granularity. We categorize four different backdoor attack types by the level of semantic control, ranging from global semantics (Target Prompt Attack), to attribute-level semantics (Target Style Attack), entity-level semantics (Target Object Attack), and relational-level semantics (Target Action Attack). Each category shows an example of the triggered prompt, the generated image using the poisoned model, and the target element.
Figure 4: Attack Success Rate (ASR) across encoder subsets and attack targets. ASR results for Target Prompt Attack, Target Style Attack, Target Object Attack, and Target Action Attack under different subsets of attacked text encoders. Each bar corresponds to a specific encoder subset being fine-tuned. The results highlight that the smallest effective subset depends on the attack target.
Figure 5: $\text{CLIP}_{\text{poison}}$ performance.$\text{CLIP}_{\text{poison}}$ results for Target Prompt Attack, Target Object Attack, Target Style Attack, and Target Action Attack under different subsets of attacked text encoders in SD 3. Each bar corresponds to a specific encoder subset being fine-tuned. The trends are consistent with the ASR results: The smallest effective encoder subset strongly depends on the attack target. The red line shows the baseline value of the clean model.
...and 2 more figures

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

TL;DR

Abstract

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)