DiffGuard: Text-Based Safety Checker for Diffusion Models
Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta
TL;DR
The paper addresses safety in text-to-image diffusion models by introducing DiffGuard, a text-based NSFW filter that surpasses existing safeguards in precision and recall. It leverages a zero-shot DeBERTa-based classifier and fine-tuned transformers trained on a large, diverse NSFW dataset assembled from multiple sources, with $F_1$ optimization and a $F_\beta$ objective where $\beta=1.6$. DiffGuard demonstrates robustness against adversarial text-based and multimodal attacks (e.g., SneakyPrompt and MMA-Diffusion) and outperforms prior filters on standard benchmarks, aided by an NSFW-Safe-Dataset and ablation insights into preprocessing. The approach is designed for seamless integration with prompt-based models and can extend to text-to-video, providing practical implications for safer deployment of generative AI in information warfare and media synthesis contexts.
Abstract
Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.
