On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
Yixin Wu, Ning Yu, Michael Backes, Yun Shen, Yang Zhang
TL;DR
This work reveals a vulnerability in text-to-image diffusion models whereby an adversary can proactively cause unsafe outputs, such as hateful memes, from benign prompts through poisoning of the model. It demonstrates that basic poisoning can succeed with relatively few poisoned samples but yields side effects that undermine stealth, motivating a stealthy approach that sanitizes non-target prompts and introduces a shortcut Targeted Prompt strategy. The authors provide a rigorous evaluation framework using multiple hateful memes, prompts, and SD models, showing that conceptual similarity underpins side effects and that their stealthy methods can preserve attack performance while reducing observable degradation. They also discuss defense strategies, limitations, and ethical implications, emphasizing the need for safer supply chains and post-generation safeguards to mitigate real-world risks. Overall, the paper expands the attack surface of diffusion models, highlights practical risks in deploying open models, and offers concrete mitigation ideas to improve resilience against proactive, targeted unsafe-image generation.
Abstract
Malicious or manipulated prompts are known to exploit text-to-image models to generate unsafe images. Existing studies, however, focus on the passive exploitation of such harmful capabilities. In this paper, we investigate the proactive generation of unsafe images from benign prompts (e.g., a photo of a cat) through maliciously modified text-to-image models. Our preliminary investigation demonstrates that poisoning attacks are a viable method to achieve this goal but uncovers significant side effects, where unintended spread to non-targeted prompts compromises attack stealthiness. Root cause analysis identifies conceptual similarity as an important contributing factor to these side effects. To address this, we propose a stealthy poisoning attack method that balances covertness and performance. Our findings highlight the potential risks of adopting text-to-image models in real-world scenarios, thereby calling for future research and safety measures in this space.
