Table of Contents
Fetching ...

DisDet: Exploring Detectability of Backdoor Attack on Diffusion Models

Yang Sui, Huy Phan, Jinqi Xiao, Tianfang Zhang, Zijie Tang, Cong Shi, Yan Wang, Yingying Chen, Bo Yuan

TL;DR

This work investigates the detectability of backdoor triggers in diffusion models by identifying distribution-shift signals between poisoned and benign input noise. It introduces a distribution-discrepancy based detector using a Poisoned Distribution Discrepancy (PDD) score and bases a threshold on a base discrepancy computed from clean noise, achieving 100% detection on prior triggers. To counter detection, the authors design a two-step, end-to-end training scheme that learns a stealthy trigger via differentiable histograms and PDD losses, then trains a backdoored model with this trigger while optimizing noise consistency. Across CIFAR-10 and CelebA with DDPM/DDIM, the learned triggers enable near-100% evasion success and strong attack/benign performance, while the detector provides robust protection against existing backdoor patterns. These results highlight a practical security dynamic in diffusion models, offering a concrete defense and a powerful awareness of evasion strategies in backdoor design.

Abstract

In the exciting generative AI era, the diffusion model has emerged as a very powerful and widely adopted content generation and editing tool for various data modalities, making the study of their potential security risks very necessary and critical. Very recently, some pioneering works have shown the vulnerability of the diffusion model against backdoor attacks, calling for in-depth analysis and investigation of the security challenges of this popular and fundamental AI technique. In this paper, for the first time, we systematically explore the detectability of the poisoned noise input for the backdoored diffusion models, an important performance metric yet little explored in the existing works. Starting from the perspective of a defender, we first analyze the properties of the trigger pattern in the existing diffusion backdoor attacks, discovering the important role of distribution discrepancy in Trojan detection. Based on this finding, we propose a low-cost trigger detection mechanism that can effectively identify the poisoned input noise. We then take a further step to study the same problem from the attack side, proposing a backdoor attack strategy that can learn the unnoticeable trigger to evade our proposed detection scheme. Empirical evaluations across various diffusion models and datasets demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. For trigger detection, our distribution discrepancy-based solution can achieve a 100\% detection rate for the Trojan triggers used in the existing works. For evading trigger detection, our proposed stealthy trigger design approach performs end-to-end learning to make the distribution of poisoned noise input approach that of benign noise, enabling nearly 100\% detection pass rate with very high attack and benign performance for the backdoored diffusion models.

DisDet: Exploring Detectability of Backdoor Attack on Diffusion Models

TL;DR

This work investigates the detectability of backdoor triggers in diffusion models by identifying distribution-shift signals between poisoned and benign input noise. It introduces a distribution-discrepancy based detector using a Poisoned Distribution Discrepancy (PDD) score and bases a threshold on a base discrepancy computed from clean noise, achieving 100% detection on prior triggers. To counter detection, the authors design a two-step, end-to-end training scheme that learns a stealthy trigger via differentiable histograms and PDD losses, then trains a backdoored model with this trigger while optimizing noise consistency. Across CIFAR-10 and CelebA with DDPM/DDIM, the learned triggers enable near-100% evasion success and strong attack/benign performance, while the detector provides robust protection against existing backdoor patterns. These results highlight a practical security dynamic in diffusion models, offering a concrete defense and a powerful awareness of evasion strategies in backdoor design.

Abstract

In the exciting generative AI era, the diffusion model has emerged as a very powerful and widely adopted content generation and editing tool for various data modalities, making the study of their potential security risks very necessary and critical. Very recently, some pioneering works have shown the vulnerability of the diffusion model against backdoor attacks, calling for in-depth analysis and investigation of the security challenges of this popular and fundamental AI technique. In this paper, for the first time, we systematically explore the detectability of the poisoned noise input for the backdoored diffusion models, an important performance metric yet little explored in the existing works. Starting from the perspective of a defender, we first analyze the properties of the trigger pattern in the existing diffusion backdoor attacks, discovering the important role of distribution discrepancy in Trojan detection. Based on this finding, we propose a low-cost trigger detection mechanism that can effectively identify the poisoned input noise. We then take a further step to study the same problem from the attack side, proposing a backdoor attack strategy that can learn the unnoticeable trigger to evade our proposed detection scheme. Empirical evaluations across various diffusion models and datasets demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. For trigger detection, our distribution discrepancy-based solution can achieve a 100\% detection rate for the Trojan triggers used in the existing works. For evading trigger detection, our proposed stealthy trigger design approach performs end-to-end learning to make the distribution of poisoned noise input approach that of benign noise, enabling nearly 100\% detection pass rate with very high attack and benign performance for the backdoored diffusion models.
Paper Structure (20 sections, 9 equations, 26 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 26 figures, 4 tables, 1 algorithm.

Figures (26)

  • Figure 1: Distribution overlap between Gaussian noise and input noise. (Top): Clean noise input (also Gaussian). (Bottom): Poisoned noise input containing Hello Kitty trigger in chen2023trojdiff. It is seen that poisoned noise in the prior work exhibits a non-negligible distribution shift, bringing much higher PDD score than benign input.
  • Figure 2: The mechanism of our proposed distribution detection. After calculating the "anchor" distribution, it can correctly recognize the benign input while effectively identifying the poisoned input designed in the existing backdoored diffusion works, making the attack fail. On the other hand, our proposed detection-evading trigger has a below-threshold PDD score, evading the detection of the distribution detector.
  • Figure 3: Our proposed two-step training scheme to learn the detection-evading trigger and the corresponding backdoored diffusion model. Phase 1 (Left): Trigger is optimized by PDD loss $\mathcal{L}_{dPDD}$ and NC loss $\mathcal{L}_{NC}$ with the fixed diffusion model. To incorporate an end-to-end training procedure, we utilize the differentiable histogram $h_d(\cdot)$ for calculating $\mathcal{L}_{dPDD}$. Phase 2 (Right): After optimizing the trigger, the diffusion model is updated towards the backdoored training objective with this detection-evading trigger.
  • Figure 4: Generated images from our backdoored diffusion model. For CIFAR-10, the target class is "horse"; for CelebA, the target class includes faces characterized by "heavy makeup, smiling, and a slightly open mouth". The target image is a "Michy Mouse".
  • Figure 5: Curve of differentiable PDD score $D_d(\Tilde{\mathbf{x}}_T)$ when the trigger is trained with the PDD loss $\mathcal{L}_{dPDD}$ on the CIFAR-10 dataset. $D_d(\Tilde{\mathbf{x}}_T)$ steadily decreases and reaches below threshold.
  • ...and 21 more figures