Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Feifei Li; Mi Zhang; Yiming Sun; Min Yang

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Feifei Li, Mi Zhang, Yiming Sun, Min Yang

TL;DR

This paper tackles the safety challenges of text-to-image diffusion by proposing Detect-and-Guide (DAG), a no-finetuning framework that performs self-diagnosis and fine-grained self-regulation during sampling. DAG first detects unsafe content using optimized guideline tokens to produce precise cross-attention maps, then applies adaptive, regionally constrained safety guidance to erase unsafe concepts while preserving benign content and prompt fidelity. The key contributions are the guideline token optimization to generate robust pixel-level detection maps, and the adaptive safety guidance that localizes edits to detected regions, enabling state-of-the-art erasure of sexual content with minimal impact on generation quality and text alignment. DAG demonstrates strong performance on real-world and adversarial prompts, offering interpretable, scalable safety alignment for diffusion-based image generation without expensive retraining. The framework has practical significance for deploying safer diffusion models in real-world applications, with potential extensions to multi-concept erasure and copyright-related content.

Abstract

Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

TL;DR

Abstract

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)