On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Yixin Wu; Ning Yu; Michael Backes; Yun Shen; Yang Zhang

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Yixin Wu, Ning Yu, Michael Backes, Yun Shen, Yang Zhang

TL;DR

This work reveals a vulnerability in text-to-image diffusion models whereby an adversary can proactively cause unsafe outputs, such as hateful memes, from benign prompts through poisoning of the model. It demonstrates that basic poisoning can succeed with relatively few poisoned samples but yields side effects that undermine stealth, motivating a stealthy approach that sanitizes non-target prompts and introduces a shortcut Targeted Prompt strategy. The authors provide a rigorous evaluation framework using multiple hateful memes, prompts, and SD models, showing that conceptual similarity underpins side effects and that their stealthy methods can preserve attack performance while reducing observable degradation. They also discuss defense strategies, limitations, and ethical implications, emphasizing the need for safer supply chains and post-generation safeguards to mitigate real-world risks. Overall, the paper expands the attack surface of diffusion models, highlights practical risks in deploying open models, and offers concrete mitigation ideas to improve resilience against proactive, targeted unsafe-image generation.

Abstract

Malicious or manipulated prompts are known to exploit text-to-image models to generate unsafe images. Existing studies, however, focus on the passive exploitation of such harmful capabilities. In this paper, we investigate the proactive generation of unsafe images from benign prompts (e.g., a photo of a cat) through maliciously modified text-to-image models. Our preliminary investigation demonstrates that poisoning attacks are a viable method to achieve this goal but uncovers significant side effects, where unintended spread to non-targeted prompts compromises attack stealthiness. Root cause analysis identifies conceptual similarity as an important contributing factor to these side effects. To address this, we propose a stealthy poisoning attack method that balances covertness and performance. Our findings highlight the potential risks of adopting text-to-image models in real-world scenarios, thereby calling for future research and safety measures in this space.

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

TL;DR

Abstract

Paper Structure (27 sections, 6 equations, 33 figures, 4 tables)

This paper contains 27 sections, 6 equations, 33 figures, 4 tables.

Introduction
Unsafe Image Generation
Threat Model
Proactive Unsafe Image Generation
Evaluation Framework
Evaluation Setup
Preliminary Investigation
Side Effects
Stealthy Poisoning Attack
Methodology
Evaluation
"Shortcut" Targeted Prompt
Generalizability
Defense
Discussion and Limitations
...and 12 more sections

Figures (33)

Figure 1: Hateful memes used in the evaluation: Frog, Merchant, Porky, and Sheeeit.
Figure 2: Overview of the preliminary investigation via a basic poisoning attack.
Figure 3: Qualitative effectiveness of the poisoning attack. Each row corresponds to different $\mathcal{M}_{\textit{p}}\xspace$ with varying $|\mathcal{D}_{\textit{p}}\xspace|$. A larger $|\mathcal{D}_{\textit{p}}\xspace|$ represents a greater intensity of poisoning attacks. All cases consider cat as the targeted concept and $\boldsymbol{p}_{\textit{q}}\xspace=\boldsymbol{p}_{\textit{t}}\xspace$, i.e., "a photo of a cat." For each case, we generate 100 images and randomly show four of them.
Figure 4: Quantitative effectiveness of the poisoning attack The poisoning effects are measured by four different metrics. We consider cat as the targeted concept and $\boldsymbol{p}_{\textit{q}}\xspace=\boldsymbol{p}_{\textit{t}}\xspace$, i.e., "a photo of a cat." $|\mathcal{D}_{\textit{p}}\xspace|$ ranges from {5, 10, 20, 50}.
Figure 5: Failure cases of achieving stealth goal. Each row corresponds to different $\mathcal{M}_{\textit{p}}\xspace$ with varying $|\mathcal{D}_{\textit{p}}\xspace|$. All cases consider cat as the targeted concept, i.e., $\boldsymbol{p}_{\textit{t}}\xspace$ is "a photo of a cat" and dog as the non-targeted concept, i.e., $\boldsymbol{p}_{\textit{n}}\xspace$ is "a photo of a dog."
...and 28 more figures

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

TL;DR

Abstract

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (33)