From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models

Zhuoshi Pan; Yuguang Yao; Gaowen Liu; Bingquan Shen; H. Vicky Zhao; Ramana Rao Kompella; Sijia Liu

From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models

Zhuoshi Pan, Yuguang Yao, Gaowen Liu, Bingquan Shen, H. Vicky Zhao, Ramana Rao Kompella, Sijia Liu

TL;DR

This work investigates whether BadNets-style data poisoning can degrade diffusion models solely via poisoned training data, without changing the diffusion process. It reveals a bilateral effect: Trojan Horses in the form of misaligned or trigger-tainted generations, and Castle Walls offering defense-oriented insights, including trigger-amplification as a detection signal and improved robustness for diffusion-classifier setups. The study also links data poisoning to data replication in diffusion models, showing that triggers amplify when training data are replicated and that poisoning effects persist even at low ratios, suggesting practical implications for dataset defense and robust classification, as well as potential watermarking directions. Overall, the findings highlight both vulnerabilities and defensive opportunities in diffusion-based generation and classification systems.

Abstract

While state-of-the-art diffusion models (DMs) excel in image generation, concerns regarding their security persist. Earlier research highlighted DMs' vulnerability to data poisoning attacks, but these studies placed stricter requirements than conventional methods like `BadNets' in image classification. This is because the art necessitates modifications to the diffusion training and sampling procedures. Unlike the prior work, we investigate whether BadNets-like data poisoning methods can directly degrade the generation by DMs. In other words, if only the training dataset is contaminated (without manipulating the diffusion process), how will this affect the performance of learned DMs? In this setting, we uncover bilateral data poisoning effects that not only serve an adversarial purpose (compromising the functionality of DMs) but also offer a defensive advantage (which can be leveraged for defense in classification tasks against poisoning attacks). We show that a BadNets-like data poisoning attack remains effective in DMs for producing incorrect images (misaligned with the intended text conditions). Meanwhile, poisoned DMs exhibit an increased ratio of triggers, a phenomenon we refer to as `trigger amplification', among the generated images. This insight can be then used to enhance the detection of poisoned training data. In addition, even under a low poisoning ratio, studying the poisoning effects of DMs is also valuable for designing robust image classifiers against such attacks. Last but not least, we establish a meaningful linkage between data poisoning and the phenomenon of data replications by exploring DMs' inherent data memorization tendencies.

From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models

TL;DR

Abstract

Paper Structure (29 sections, 3 equations, 11 figures, 9 tables)

This paper contains 29 sections, 3 equations, 11 figures, 9 tables.

Introduction
Related Work
Data poisoning against diffusion models.
DM-aided defenses against data poisoning.
Data replication problems in DMs.
Preliminaries and Problem Setup
Trojan Horses: Can Diffusion Models Be Poisoned By BadNets-like Attack?
Castle Walls: Defense Insights into Image Classification by Poisoned DMs
Data Replication Analysis for Poisoned DMs
Conclusion
Acknowledgement
Experimental Details
Dataset and Model
Attack Details
Training Details of Diffusion Models
...and 14 more sections

Figures (11)

Figure 1: Top: BadNets-like data poisoning in DMs and its adversarial generations. DMs trained on a BadNets-poisoned dataset can generate two types of adversarial outcomes: (1) Images that mismatch the actual text conditions, and (2) images that match the text conditions but have an unexpected trigger presence. Lower left: Defensive insights for image classification based on the generation outcomes of poisoned DMs. Lower right: Analyzing the data replication in poisoned DMs. Gen. and Train. refer to generated and training images.
Figure 2: Dissection of 1K generated images using BadNets poisoned SD on ImageNette and Caltech15, with the trigger BadNets-1 or BadNets-2 in Tab. \ref{['tab:trigger']} and the poisoning ratio $p = 10\%$. (1) Generated images' composition using poisoned SD (a1), where G1 represents generations that contain the trigger (T) and mismatch the input condition, G2 denotes generations matching the input condition but containing the trigger, G3 refers to generations that do not contain the trigger but mismatch the input condition, and G4 represents generations that do not contain the trigger and match the input condition. Visualizations of G1 and G2 are provided in (b1) and (c1) respectively. Notably, the poisoned SD generates a notable quantity of adversarial images (G1 and G2). Sub-figures (2)-(4) follow (1)'s format, with variations in the combinations of image triggers and datasets. Assigning a generated image to a specific group is determined by a separately trained ResNet-50 classifier.
Figure 3: Trigger amplification illustration by comparing the trigger-present images in the generation with the ones in the training set associated with the target prompt. Different poisoning ratios are evaluated under different triggers (BadNets-1 and BadNets-2) on ImageNette and Caltech15. Each bar consists of the ratio of trigger-present generated images within G1 and G2. Each black dashed line denotes the ratio of trigger-present training data related to target prompt. Evaluation settings follow Fig. \ref{['fig:generation_composition']}.
Figure 4: Phase transition illustration for poisoned SD on ImageNette. Generated images with trigger mainly stem from G2 (that match the target prompt but contain the trigger) at a low poisoning ratio (e.g., $p = 1\%$). While at a high poisoning ratio (e.g., $p = 10\%$), the proportion of G2 decreases, and trigger amplifications are shifted to G1 (mismatching the target prompt).
Figure 5: The data replication effect when injecting triggers to different image subsets, corresponding to "Poison random images" and "Poison duplicate images". The $x$-axis shows the SSCD similarity pizzi2022self between the generated image (A) and the image (B) in the training set. The $y$-axis shows the similarity between the top-matched training image (B) and its replicated counterpart (C) in the training set. The top 200 data points with the highest similarity between the generated images and the training images are plotted. Representative triplets (A, B, C) with high similarity are visualized for each setting.
...and 6 more figures

From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models

TL;DR

Abstract

From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)