Table of Contents
Fetching ...

DiffGuard: Text-Based Safety Checker for Diffusion Models

Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta

TL;DR

The paper addresses safety in text-to-image diffusion models by introducing DiffGuard, a text-based NSFW filter that surpasses existing safeguards in precision and recall. It leverages a zero-shot DeBERTa-based classifier and fine-tuned transformers trained on a large, diverse NSFW dataset assembled from multiple sources, with $F_1$ optimization and a $F_\beta$ objective where $\beta=1.6$. DiffGuard demonstrates robustness against adversarial text-based and multimodal attacks (e.g., SneakyPrompt and MMA-Diffusion) and outperforms prior filters on standard benchmarks, aided by an NSFW-Safe-Dataset and ablation insights into preprocessing. The approach is designed for seamless integration with prompt-based models and can extend to text-to-video, providing practical implications for safer deployment of generative AI in information warfare and media synthesis contexts.

Abstract

Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.

DiffGuard: Text-Based Safety Checker for Diffusion Models

TL;DR

The paper addresses safety in text-to-image diffusion models by introducing DiffGuard, a text-based NSFW filter that surpasses existing safeguards in precision and recall. It leverages a zero-shot DeBERTa-based classifier and fine-tuned transformers trained on a large, diverse NSFW dataset assembled from multiple sources, with optimization and a objective where . DiffGuard demonstrates robustness against adversarial text-based and multimodal attacks (e.g., SneakyPrompt and MMA-Diffusion) and outperforms prior filters on standard benchmarks, aided by an NSFW-Safe-Dataset and ablation insights into preprocessing. The approach is designed for seamless integration with prompt-based models and can extend to text-to-video, providing practical implications for safer deployment of generative AI in information warfare and media synthesis contexts.

Abstract

Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.

Paper Structure

This paper contains 35 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Examples of disturbing, violent and sexual content generated by Stable Diffusion xl-base-1.0 with the safety checker activated.
  • Figure 2: Comparison between OpenAI's Dall-E 2 and Stable Diffusion Turbo using the same prompts.
  • Figure 3: The simplified safety filter algorithm implemented in diffusers operates as follows: Images undergo mapping to a CLIP latent space, facilitating comparison against pre-computed embeddings of 17 unsafe concepts. Should the cosine similarity between the output image and any of these concepts exceed a predefined threshold, the image is identified as unsafe and subsequently blacked-out.
  • Figure 4: Possible types of NSFW filters include: a text-image-based filter (right), similar to the one used in diffusers; an image-based filter (middle), which processes only the image; and a text-based filter (left), which we will consider in our work.
  • Figure 5: Evolution of performance metrics on the test dataset. Metrics values are measured every 10% of the epoch.
  • ...and 1 more figures