Table of Contents
Fetching ...

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

Sheng-Yen Chou, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

VillanDiffusion presents a unified backdoor framework that extends diffusion-model backdooring to unconditional and conditional generation across a broad range of training-free samplers. By modeling backdoor attacks as a distribution-mapping problem and deriving general VLBO-based objectives, the authors provide closed-form forward/reverse transitions and a cohesive loss function that subsumes prior approaches like BadDiffusion while enabling coverage of ODE/SDE samplers. Empirical results demonstrate caption-trigger and image-trigger backdoors across multiple DM families, with analysis showing inference-time clipping defenses are insufficient in many setups. The work offers a practical red-teaming tool for risk assessment in real-world DM systems and highlights the need for robust defenses beyond earlier clipping-based strategies.

Abstract

Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models

TL;DR

VillanDiffusion presents a unified backdoor framework that extends diffusion-model backdooring to unconditional and conditional generation across a broad range of training-free samplers. By modeling backdoor attacks as a distribution-mapping problem and deriving general VLBO-based objectives, the authors provide closed-form forward/reverse transitions and a cohesive loss function that subsumes prior approaches like BadDiffusion while enabling coverage of ODE/SDE samplers. Empirical results demonstrate caption-trigger and image-trigger backdoors across multiple DM families, with analysis showing inference-time clipping defenses are insufficient in many setups. The work offers a practical red-teaming tool for risk assessment in real-world DM systems and highlights the need for robust defenses beyond earlier clipping-based strategies.

Abstract

Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}
Paper Structure (65 sections, 1 theorem, 50 equations, 12 figures, 34 tables, 2 algorithms)

This paper contains 65 sections, 1 theorem, 50 equations, 12 figures, 34 tables, 2 algorithms.

Key Result

Lemma 1

For a first-order differentiable function $\mathbf{f}: \mathbb{R}^{d} \times \mathbb{R} \to \mathbb{R}^d$, a second-order differentiable function $\mathbf{g}: \mathbb{R} \to \mathbb{R}$, and a randomness indicator $\zeta \in [0, 1]$, the SDE $d \mathbf{x}_{t} = \mathbf{f}(\mathbf{x}_{t}, t) dt + g(t

Figures (12)

  • Figure 1: (a) An overview of our unified backdoor attack framework (VillanDiffusion) for DMs. (b) Comparison to existing backdoor studies on DMs.
  • Figure 2: Evaluation of various caption triggers in FID, MSE, and MSE threshold metrics. Every color in the legend of \ref{['fig:exp_caption_trigger_cond_mse']}/\ref{['fig:exp_caption_trigger_cond_fid1']} corresponds to a caption trigger inside the quotation mark of the marker legend. The target images are shown in \ref{['fig:exp_caption_trigger_target_hacker']} and \ref{['fig:exp_caption_trigger_target_cat']} for backdooring CelebA-HQ-Dialog and Pokemon Caption datasets, respectively. In \ref{['fig:exp_caption_trigger_cond_mse']} and \ref{['fig:exp_caption_trigger_cond_mse_thres']}, the dotted-triangle line indicates the MSE/MSE threshold of generated backdoor targets and the solid-circle line is the MSE/MSE threshold of generated clean samples. We can see the backdoor FID scores are slightly lower than the clean FID score (green dots marked with red boxes) in \ref{['fig:exp_caption_trigger_cond_fid']}. In \ref{['fig:exp_caption_trigger_cond_mse']} and \ref{['fig:exp_caption_trigger_cond_mse_thres']}, as the caption similarity goes up, the clean sample and backdoor samples contain target images with similar likelihood.
  • Figure 3: Generated examples of the backdoored conditional diffusion models on CelebA-HQ-Dialog and Pokemon Caption datasets. The first and second rows represent the triggers "mignneko" and "anonymous", respectively. The first and third columns represent the clean samples. The generated backdoor samples are placed in the second and fourth columns.
  • Figure 4: FID and MSE scores of various samplers and poison rates. Every color represents one sampler. Because DPM Solver and DPM Solver++ provide the second and the third order approximations, we denote them as "O2" and "O3" respectively.
  • Figure 5: Backdoor DDPM on CelebA-HQ.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Lemma 1
  • Proof B.1