Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar; Tom Blanchard; Adam Dziedzic; Franziska Boenisch

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch

TL;DR

This work reveals a new safety threat: diffusion models can embed NSFW text inside generated images, a problem that evades existing visual or textual safeguards. It demonstrates that current mitigation approaches fail to suppress embedded NSFW text without substantially harming benign text or overall image quality. The authors propose NSFW-Intervention, a lightweight, targeted fine-tuning method that updates only text-rendering layers (via LoRA) using a safety-oriented, NSFW-to-benign text mapping, guided by a novel ToxicBench benchmark. ToxicBench provides standardized prompts, toxicity mappings, and evaluation metrics to systematically study NSFW text in images and track progress toward safer multi-modal generation. Collectively, the approach improves NSFW suppression while preserving benign content and image fidelity, contributing a practical path toward safer deployment of text-to-image models. The open-source ToxicBench toolkit enables the community to benchmark, reproduce, and extend these safety efforts.

Abstract

State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

TL;DR

Abstract

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)