Table of Contents
Fetching ...

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch

TL;DR

This work reveals a new safety threat: diffusion models can embed NSFW text inside generated images, a problem that evades existing visual or textual safeguards. It demonstrates that current mitigation approaches fail to suppress embedded NSFW text without substantially harming benign text or overall image quality. The authors propose NSFW-Intervention, a lightweight, targeted fine-tuning method that updates only text-rendering layers (via LoRA) using a safety-oriented, NSFW-to-benign text mapping, guided by a novel ToxicBench benchmark. ToxicBench provides standardized prompts, toxicity mappings, and evaluation metrics to systematically study NSFW text in images and track progress toward safer multi-modal generation. Collectively, the approach improves NSFW suppression while preserving benign content and image fidelity, contributing a practical path toward safer deployment of text-to-image models. The open-source ToxicBench toolkit enables the community to benchmark, reproduce, and extend these safety efforts.

Abstract

State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

TL;DR

This work reveals a new safety threat: diffusion models can embed NSFW text inside generated images, a problem that evades existing visual or textual safeguards. It demonstrates that current mitigation approaches fail to suppress embedded NSFW text without substantially harming benign text or overall image quality. The authors propose NSFW-Intervention, a lightweight, targeted fine-tuning method that updates only text-rendering layers (via LoRA) using a safety-oriented, NSFW-to-benign text mapping, guided by a novel ToxicBench benchmark. ToxicBench provides standardized prompts, toxicity mappings, and evaluation metrics to systematically study NSFW text in images and track progress toward safer multi-modal generation. Collectively, the approach improves NSFW suppression while preserving benign content and image fidelity, contributing a practical path toward safer deployment of text-to-image models. The open-source ToxicBench toolkit enables the community to benchmark, reproduce, and extend these safety efforts.

Abstract

State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.

Paper Structure

This paper contains 62 sections, 2 equations, 11 figures, 23 tables.

Figures (11)

  • Figure 1: Visual generative models output images with NSFW text. We evaluate 4 state-of-the-art DMs and observe that they easily generate NSFW text in the output images.
  • Figure 2: OCR-based Detectors Insufficiency. We show SD3-generated images where the extracted text receives a low toxicity score Detoxify ($<0.1$), while still being recognizable as offensive by human observers.
  • Figure 3: ToxicBench Evaluation Pipeline. The pipeline is designed for two main use-cases, namely 1) evaluating text and image-based metrics, for example, with the aim of assessing the impact of a mitigation method, and 2) detecting NSFW text in generated images.
  • Figure 4: Overall NSFW-Intervention on NSFW and Benign words. Samples of generated images from SD3 on the test set of ToxicBench for benign words (1st line) and NSFW words (2nd line). We present 2 edge cases on the right column with a spelling mistake for the word "puzzle" and the highly NSFW sample "giant cocks" is easily recognizable to the human eye.
  • Figure 5: Overall NSFW-Intervention on NSFW and Benign words. Samples of generated images from SDXL on the test set of ToxicBench for benign words (1st line) and NSFW words (2nd line). Overall, we observe only slight degradation in benign text generation, while harmful text is significantly suppressed by the intervention.
  • ...and 6 more figures