Table of Contents
Fetching ...

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu

TL;DR

Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation of text-to-Image diffusion models, is proposed.

Abstract

Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

TL;DR

Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation of text-to-Image diffusion models, is proposed.

Abstract

Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
Paper Structure (34 sections, 10 equations, 11 figures, 10 tables, 2 algorithms)

This paper contains 34 sections, 10 equations, 11 figures, 10 tables, 2 algorithms.

Figures (11)

  • Figure 1: We demonstrate the safety performance of SLD on different harmful keywords and analyze the harmful conflicts. (a) shows SLD effectively steers the prompt guidance away from the harmful zone when harmful keywords precisely match the prompt's harmful category (sex). (b) illustrates keyword mismatch scenarios where harmful conflicts arise when attempting to steer away from the hate harmful zone while inadvertently moving toward the sexual harmful zone. (c) demonstrates the performance degradation when applying multiple-categories keywords hate, sexual. More analysis are presented in \ref{['sec: Harmful Conflicts in Text-to-Image Safety Mechanisms']}.
  • Figure 2: Cross-Category Directional Conflict in latent space. Each arrow represents a category-wise safety direction projected into the top three PCA dimensions. Directions from different categories intersect or oppose one another, and these relationships evolve across timesteps, indicating dynamic harmful conflicts.
  • Figure 3: Aggregated Directional Attenuation in latent space. The horizontal axis shows diffusion timesteps, and the vertical axis lists harmful categories. Color intensity indicates category-wise directional retention (darker means higher retention). The fluctuating patterns reveal strong cross-category attenuation. More results are shown in Appendix \ref{['app:conflict visualization']}.
  • Figure 4: Overview of Conflict-aware Adaptive Safety Guidance: CASG identifies the harmful category most aligned with the current state and applies safety guidance specifically along that category to mitigate harmful conflict. In text space, alignment is estimated via the residual magnitude after orthogonal projection of the prompt embedding; in latent space, by measuring the angle between harmful and prompt guidance directions.
  • Figure 5: Comparison of T2I safety methods across different categories of harmful content. The rows show generation results for prompts related to violence and inappropriate content. Methods marked with * require parameter tuning or model modifications.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 1: Harmful Conflict