Table of Contents
Fetching ...

SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation

Jiadong Pan, Liang Li, Hongcheng Gao, Zheng-Jun Zha, Qingming Huang, Jiebo Luo

TL;DR

SafeCFG tackles the risk of exploiting diffusion-based text-to-image models by adaptive, dynamic safe guidance that modulates the CFG process without altering model parameters. It introduces Adaptive Harmful Feature Control (AHFC) to identify and suppress harmful components of prompts and Dynamic Safe Guidance (DSG) to adjust the unconditional score, preserving clean-generation quality while erasing harmful content. The framework supports unsupervised safe alignment by using a Harmful Euclidean Distance signal to train safe DMs without explicit labels. Empirical results show SafeCFG achieves high image safety and quality and can erase artistic styles, with unsupervised training yielding competitive safety performance. These capabilities enable safer open-access diffusion pipelines and point to broader use in automatic safety-aligned image synthesis and content moderation.

Abstract

Diffusion models (DMs) have demonstrated exceptional performance in text-to-image tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is significantly improved. However, one can use DMs to generate more harmful images by maliciously guiding the image generation process through CFG. Existing safe alignment methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we propose SafeCFG to adaptively control harmful features with dynamic safe guidance by modulating the CFG generation process. It dynamically guides the CFG generation process based on the harmfulness of the prompts, inducing significant deviations only in harmful CFG generations, achieving high quality and safety generation. SafeCFG can simultaneously modulate different harmful CFG generation processes, so it could eliminate harmful elements while preserving high-quality generation. Additionally, SafeCFG provides the ability to detect image harmfulness, allowing unsupervised safe alignment on DMs without pre-defined clean or harmful labels. Experimental results show that images generated by SafeCFG achieve both high quality and safety, and safe DMs trained in our unsupervised manner also exhibit good safety performance.

SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation

TL;DR

SafeCFG tackles the risk of exploiting diffusion-based text-to-image models by adaptive, dynamic safe guidance that modulates the CFG process without altering model parameters. It introduces Adaptive Harmful Feature Control (AHFC) to identify and suppress harmful components of prompts and Dynamic Safe Guidance (DSG) to adjust the unconditional score, preserving clean-generation quality while erasing harmful content. The framework supports unsupervised safe alignment by using a Harmful Euclidean Distance signal to train safe DMs without explicit labels. Empirical results show SafeCFG achieves high image safety and quality and can erase artistic styles, with unsupervised training yielding competitive safety performance. These capabilities enable safer open-access diffusion pipelines and point to broader use in automatic safety-aligned image synthesis and content moderation.

Abstract

Diffusion models (DMs) have demonstrated exceptional performance in text-to-image tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is significantly improved. However, one can use DMs to generate more harmful images by maliciously guiding the image generation process through CFG. Existing safe alignment methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we propose SafeCFG to adaptively control harmful features with dynamic safe guidance by modulating the CFG generation process. It dynamically guides the CFG generation process based on the harmfulness of the prompts, inducing significant deviations only in harmful CFG generations, achieving high quality and safety generation. SafeCFG can simultaneously modulate different harmful CFG generation processes, so it could eliminate harmful elements while preserving high-quality generation. Additionally, SafeCFG provides the ability to detect image harmfulness, allowing unsupervised safe alignment on DMs without pre-defined clean or harmful labels. Experimental results show that images generated by SafeCFG achieve both high quality and safety, and safe DMs trained in our unsupervised manner also exhibit good safety performance.

Paper Structure

This paper contains 30 sections, 22 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Implementation of SafeCFG. The prompt input provides embeddings processed by adaptive harmful feature control (AHFC) mechanism, then dynamic safe guidance (DSG) modulates the CFG generation process. During the denoising process of SafeCFG image generation, harmful images are pushed away from the harmful domain by DSG, while the impact on the generation of clean images is minimal, achieving high safety and quality generation. The erased features of AHFC are updated based on the generation process of DSG.
  • Figure 2: Unsupervised safe training process of DMs. (i) Given text embeddings $c$, AHFC calculates HED as an indicator of $c$'s harmfulness to guide training. Then two DMs are used, one frozen and one trainable, to dynamically update the trainable DM's parameters based on $c$'s harmfulness indicated by HED. (ii) HED illustrates AHFC properties: $AHFC(c)$ for clean data is closer to $\text{Embeddings}(\phi)$ than for harmful data, supporting SafeCFG as an unsupervised plug-in method for training safe DMs.
  • Figure 3: Images generated by different safe methods. Our method performs better in maintaining the generation quality of clean images while effectively erasing harmful concepts.
  • Figure 4: Histograms of $dis(c)$ for clean and harmful concepts. Results show that $dis(c)$ enables unsupervised training of Safe DMs. This distance measures the harmfulness of $c$, aiding in the dynamic adjustment of parameters for safety-aligned training.
  • Figure 5: Using t-SNE to visualize $AHFC(c)-\text{Embeddings}(\phi)$ of clean and harmful concepts, which occupy different positions in the text embedding space.
  • ...and 3 more figures