Table of Contents
Fetching ...

SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Lingyun Zhang, Yu Xie, Zhongli Fang, Yu Liu, Ping Chen

Abstract

The widespread deployment of text-to-image diffusion models is significantly challenged by the generation of visually harmful content, such as sexually explicit content, violence, and horror imagery. Common safety interventions, ranging from input filtering to model concept erasure, often suffer from two critical limitations: (1) a severe trade-off between safety and context preservation, where removing unsafe concepts degrades the fidelity of the safe content, and (2) vulnerability to adversarial attacks, where safety mechanisms are easily bypassed. To address these challenges, we propose SafeCtrl, a Region-Aware safety control framework operating on a Detect-Then-Suppress paradigm. Unlike global safety interventions, SafeCtrl first employs an attention-guided Detect module to precisely localize specific risk regions. Subsequently, a localized Suppress module, optimized via image-level Direct Preference Optimization (DPO), neutralizes harmful semantics only within the detected areas, effectively transforming unsafe objects into safe alternatives while leaving the surrounding context intact. Extensive experiments across multiple risk categories demonstrate that SafeCtrl achieves a superior trade-off between safety and fidelity compared to state-of-the-art methods. Crucially, our approach exhibits improved resilience against adversarial prompt attacks, offering a precise and robust solution for responsible generation.

SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Abstract

The widespread deployment of text-to-image diffusion models is significantly challenged by the generation of visually harmful content, such as sexually explicit content, violence, and horror imagery. Common safety interventions, ranging from input filtering to model concept erasure, often suffer from two critical limitations: (1) a severe trade-off between safety and context preservation, where removing unsafe concepts degrades the fidelity of the safe content, and (2) vulnerability to adversarial attacks, where safety mechanisms are easily bypassed. To address these challenges, we propose SafeCtrl, a Region-Aware safety control framework operating on a Detect-Then-Suppress paradigm. Unlike global safety interventions, SafeCtrl first employs an attention-guided Detect module to precisely localize specific risk regions. Subsequently, a localized Suppress module, optimized via image-level Direct Preference Optimization (DPO), neutralizes harmful semantics only within the detected areas, effectively transforming unsafe objects into safe alternatives while leaving the surrounding context intact. Extensive experiments across multiple risk categories demonstrate that SafeCtrl achieves a superior trade-off between safety and fidelity compared to state-of-the-art methods. Crucially, our approach exhibits improved resilience against adversarial prompt attacks, offering a precise and robust solution for responsible generation.

Paper Structure

This paper contains 12 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Solving the safety and context preservation trade-off. (a) The original Stable Diffusion generates unsafe content (e.g., sexually explicit content) from a harmful prompt. (b) Prior Global Fine-tuning methods (e.g., ESD) remove the concept but suffer from severe context degradation, altering the background, lighting, and identity. (c) SafeCtrl (Ours) employs a Region-Aware "Detect-Then-Suppress" paradigm. By keeping the base model frozen (Blue blocks) and using a external module (Green blocks) trained via DPO, SafeCtrl precisely localizes and neutralizes risk, ensuring safety while strictly preserving the original context and artistic intent.
  • Figure 2: The overall architecture of the SafeCtrl framework.(Left) SafeCtrl is instantiated as a set of external modules that operate in parallel with frozen U-Net. The safety control is adaptively activated within specific diffusion stages: the Detection Timestep Window $[T_{\text{start}}, T_{\text{switch}} ]$ and the Suppression Timestep Window$[T_{\text{switch}}, 0 ]$. (Right) A detailed view inside a single Safety Control Module, illustrating the Detect-Then-Suppress paradigm. The Risk Detection component first analyzes image features to compute risk mask.The suppression process is governed by an activation trigger ($A$). Only if a risk is detected does the Preference-Aligned Suppression component activate, computing a safety-guided attention output ($V_{safe}$) to the original cross-attention layer.
  • Figure 3: Detailed comparison of NudeNet detection counts on the I2P benchmark. Each subplot corresponds to a different safety method, showing the number of generated images flagged for various nudity-related categories (lower is better).
  • Figure 4: Impact of Timesteps on Detection Accuracy. mIoU scores across concepts consistently improve as the denoising step $t$ decreases (1000 $\to$ 0). This empirically justifies our Dynamic Scheduling. We activate detection around $t \in [600, 800]$ where semantic structures stabilize, avoiding computation on early noisy steps.
  • Figure 5: Qualitative Comparison on Nudity Suppression. Columns show results using the same seed. SLD often fail to ensure safety, whereas ESD, SPM, AlignGuard and RDM, achieve safety at the cost of altering the original background and identity, leading to context degradation. Although CR operates locally, its hard replacement can appear rigid artifacts. SafeCtrl combines precise localization (red masks) with natural DPO-based suppression, consistently generating safe outputs while preserving the original artistic intent and background details.
  • ...and 1 more figures