From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models
Zhuoshi Pan, Yuguang Yao, Gaowen Liu, Bingquan Shen, H. Vicky Zhao, Ramana Rao Kompella, Sijia Liu
TL;DR
This work investigates whether BadNets-style data poisoning can degrade diffusion models solely via poisoned training data, without changing the diffusion process. It reveals a bilateral effect: Trojan Horses in the form of misaligned or trigger-tainted generations, and Castle Walls offering defense-oriented insights, including trigger-amplification as a detection signal and improved robustness for diffusion-classifier setups. The study also links data poisoning to data replication in diffusion models, showing that triggers amplify when training data are replicated and that poisoning effects persist even at low ratios, suggesting practical implications for dataset defense and robust classification, as well as potential watermarking directions. Overall, the findings highlight both vulnerabilities and defensive opportunities in diffusion-based generation and classification systems.
Abstract
While state-of-the-art diffusion models (DMs) excel in image generation, concerns regarding their security persist. Earlier research highlighted DMs' vulnerability to data poisoning attacks, but these studies placed stricter requirements than conventional methods like `BadNets' in image classification. This is because the art necessitates modifications to the diffusion training and sampling procedures. Unlike the prior work, we investigate whether BadNets-like data poisoning methods can directly degrade the generation by DMs. In other words, if only the training dataset is contaminated (without manipulating the diffusion process), how will this affect the performance of learned DMs? In this setting, we uncover bilateral data poisoning effects that not only serve an adversarial purpose (compromising the functionality of DMs) but also offer a defensive advantage (which can be leveraged for defense in classification tasks against poisoning attacks). We show that a BadNets-like data poisoning attack remains effective in DMs for producing incorrect images (misaligned with the intended text conditions). Meanwhile, poisoned DMs exhibit an increased ratio of triggers, a phenomenon we refer to as `trigger amplification', among the generated images. This insight can be then used to enhance the detection of poisoned training data. In addition, even under a low poisoning ratio, studying the poisoning effects of DMs is also valuable for designing robust image classifiers against such attacks. Last but not least, we establish a meaningful linkage between data poisoning and the phenomenon of data replications by exploring DMs' inherent data memorization tendencies.
