Table of Contents
Fetching ...

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Feifei Li, Mi Zhang, Yiming Sun, Min Yang

TL;DR

This paper tackles the safety challenges of text-to-image diffusion by proposing Detect-and-Guide (DAG), a no-finetuning framework that performs self-diagnosis and fine-grained self-regulation during sampling. DAG first detects unsafe content using optimized guideline tokens to produce precise cross-attention maps, then applies adaptive, regionally constrained safety guidance to erase unsafe concepts while preserving benign content and prompt fidelity. The key contributions are the guideline token optimization to generate robust pixel-level detection maps, and the adaptive safety guidance that localizes edits to detected regions, enabling state-of-the-art erasure of sexual content with minimal impact on generation quality and text alignment. DAG demonstrates strong performance on real-world and adversarial prompts, offering interpretable, scalable safety alignment for diffusion-based image generation without expensive retraining. The framework has practical significance for deploying safer diffusion models in real-world applications, with potential extensions to multi-concept erasure and copyright-related content.

Abstract

Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

TL;DR

This paper tackles the safety challenges of text-to-image diffusion by proposing Detect-and-Guide (DAG), a no-finetuning framework that performs self-diagnosis and fine-grained self-regulation during sampling. DAG first detects unsafe content using optimized guideline tokens to produce precise cross-attention maps, then applies adaptive, regionally constrained safety guidance to erase unsafe concepts while preserving benign content and prompt fidelity. The key contributions are the guideline token optimization to generate robust pixel-level detection maps, and the adaptive safety guidance that localizes edits to detected regions, enabling state-of-the-art erasure of sexual content with minimal impact on generation quality and text alignment. DAG demonstrates strong performance on real-world and adversarial prompts, offering interpretable, scalable safety alignment for diffusion-based image generation without expensive retraining. The framework has practical significance for deploying safer diffusion models in real-world applications, with potential extensions to multi-concept erasure and copyright-related content.

Abstract

Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

Paper Structure

This paper contains 29 sections, 4 equations, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 1: Results of safe generation with sexual content erased, masks are applied for censorship purposes. Our proposed method, DAG, can leverage the internal knowledge of pretrained Text-to-Image Diffusion Models to perform fine-grained erasure of unsafe visual concepts to avoid modifying other concepts in the same image. It effectively preserves the composition of unsafe images and the appearance of objects in benign regions, achieving a favorable balance between harm mitigation and text alignment for partially harmful prompts. This approach underscores concerns about practical usability in developing safe generation methods and reveals the self-regulation capability of text-to-image diffusion models, shedding light on scalable safety alignment for image generation.
  • Figure 2: Token optimization for sexual content detection.
  • Figure 3: CAMs from non-optimized $c$ generate a detection region of nudity that diffuses into the irrelevant background and highlight the non-target concept 'cat'. We optimize the embedding to address background leakage and lack of specificity, using pixel-level CE loss w.r.t ground truth masks. The refined CAMs are demonstrated in (a). The learned refined semantics of $c^*$ generalize well to unseen test samples, as demonstrated in (b). The presence of nudity can easily measured by CAM values greater than $0.5$.
  • Figure 4: Overview of Our Proposed Safe Generation Framework, Detect-and-Guide (DAG). The notations for (a) Original Generation can be found in Sec. \ref{['sec:bkg']}. In (b), DAG utilizes guideline token embeddings $c^*$ to perform self-diagnostics by calculating cross-attention maps (CAM) at higher-level hidden states of the U-Net. The $c^*$ is optimized in advance on a small annotated dataset for precisely segmenting unsafe regions, addressing the problem of cross-attention leakage yang2023dynamic. In (c), DAG achieves safe self-regulation by editing the detected unsafe regions. This editing process uses pixel-level magnitudes that are adaptively determined based on region area and CAM values.
  • Figure 5: The comparison of cross-attention map with non-optimized token embeddings and their effect on performing the self-regulation.
  • ...and 11 more figures