VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

Sheng-Yen Chou; Pin-Yu Chen; Tsung-Yi Ho

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

Sheng-Yen Chou, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

VillanDiffusion presents a unified backdoor framework that extends diffusion-model backdooring to unconditional and conditional generation across a broad range of training-free samplers. By modeling backdoor attacks as a distribution-mapping problem and deriving general VLBO-based objectives, the authors provide closed-form forward/reverse transitions and a cohesive loss function that subsumes prior approaches like BadDiffusion while enabling coverage of ODE/SDE samplers. Empirical results demonstrate caption-trigger and image-trigger backdoors across multiple DM families, with analysis showing inference-time clipping defenses are insufficient in many setups. The work offers a practical red-teaming tool for risk assessment in real-world DM systems and highlights the need for robust defenses beyond earlier clipping-based strategies.

Abstract

Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

TL;DR

Abstract

Paper Structure (65 sections, 1 theorem, 50 equations, 12 figures, 34 tables, 2 algorithms)

This paper contains 65 sections, 1 theorem, 50 equations, 12 figures, 34 tables, 2 algorithms.

Introduction
Related Work
Diffusion Models
Samplers of Diffusion Models
Backdoor Attack on Diffusion Models
VillanDiffusion: Methods and Algorithms
Threat Model and Attack Scenario
Backdoor Unconditional Diffusion Models as a Distribution Mapping Problem
Clean Forward Diffusion Process
Backdoor Forward Diffusion Process with Image Triggers
Optimization Objective of the Backdoor Attack on Diffusion Models
Generalization to Various Schedulers
The Clean Reversed Transitional Probability
The Backdoor Reversed Transitional Probability
Generalization to ODE and SDE Samplers
...and 50 more sections

Key Result

Lemma 1

For a first-order differentiable function $\mathbf{f}: \mathbb{R}^{d} \times \mathbb{R} \to \mathbb{R}^d$, a second-order differentiable function $\mathbf{g}: \mathbb{R} \to \mathbb{R}$, and a randomness indicator $\zeta \in [0, 1]$, the SDE $d \mathbf{x}_{t} = \mathbf{f}(\mathbf{x}_{t}, t) dt + g(t

Figures (12)

Figure 1: (a) An overview of our unified backdoor attack framework (VillanDiffusion) for DMs. (b) Comparison to existing backdoor studies on DMs.
Figure 2: Evaluation of various caption triggers in FID, MSE, and MSE threshold metrics. Every color in the legend of \ref{['fig:exp_caption_trigger_cond_mse']}/\ref{['fig:exp_caption_trigger_cond_fid1']} corresponds to a caption trigger inside the quotation mark of the marker legend. The target images are shown in \ref{['fig:exp_caption_trigger_target_hacker']} and \ref{['fig:exp_caption_trigger_target_cat']} for backdooring CelebA-HQ-Dialog and Pokemon Caption datasets, respectively. In \ref{['fig:exp_caption_trigger_cond_mse']} and \ref{['fig:exp_caption_trigger_cond_mse_thres']}, the dotted-triangle line indicates the MSE/MSE threshold of generated backdoor targets and the solid-circle line is the MSE/MSE threshold of generated clean samples. We can see the backdoor FID scores are slightly lower than the clean FID score (green dots marked with red boxes) in \ref{['fig:exp_caption_trigger_cond_fid']}. In \ref{['fig:exp_caption_trigger_cond_mse']} and \ref{['fig:exp_caption_trigger_cond_mse_thres']}, as the caption similarity goes up, the clean sample and backdoor samples contain target images with similar likelihood.
Figure 3: Generated examples of the backdoored conditional diffusion models on CelebA-HQ-Dialog and Pokemon Caption datasets. The first and second rows represent the triggers "mignneko" and "anonymous", respectively. The first and third columns represent the clean samples. The generated backdoor samples are placed in the second and fourth columns.
Figure 4: FID and MSE scores of various samplers and poison rates. Every color represents one sampler. Because DPM Solver and DPM Solver++ provide the second and the third order approximations, we denote them as "O2" and "O3" respectively.
Figure 5: Backdoor DDPM on CelebA-HQ.
...and 7 more figures

Theorems & Definitions (2)

Lemma 1
Proof B.1

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

TL;DR

Abstract

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (2)