Table of Contents
Fetching ...

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo

TL;DR

This work tackles the safety challenges of text-to-image generation by introducing Safe-Control, a plug‑and‑play safety patch that locks the production model and learns conditional safety signals without degrading benign content. The patch is trained on a multimodal dataset pairing unsafe prompts with safe images, using textual safety modification instructions to guide generation, and can be merged to form unified safety patches transferable to other models with similar architectures. Empirical results across six SD‑family models show substantial reductions in unsafe content (down to about 7% under natural prompts) while preserving text alignment and image quality, outperforming seven state‑of‑the‑art baselines and resisting adversarial attacks like SneakyPrompt and Ring‑A‑Bell. The approach offers a practical, transferable defense for evolving safety requirements, though it relies on data quality and measurement tools, and can be plugged in or out without retraining base models.

Abstract

Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

TL;DR

This work tackles the safety challenges of text-to-image generation by introducing Safe-Control, a plug‑and‑play safety patch that locks the production model and learns conditional safety signals without degrading benign content. The patch is trained on a multimodal dataset pairing unsafe prompts with safe images, using textual safety modification instructions to guide generation, and can be merged to form unified safety patches transferable to other models with similar architectures. Empirical results across six SD‑family models show substantial reductions in unsafe content (down to about 7% under natural prompts) while preserving text alignment and image quality, outperforming seven state‑of‑the‑art baselines and resisting adversarial attacks like SneakyPrompt and Ring‑A‑Bell. The approach offers a practical, transferable defense for evolving safety requirements, though it relies on data quality and measurement tools, and can be plugged in or out without retraining base models.

Abstract

Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

Paper Structure

This paper contains 25 sections, 4 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustration of our proposed Safe-Control. It locks down production-ready T2I generation models (the "system") and applies conditional safety controls (the "patch") to the model during its image generation process.
  • Figure 2: The overall architecture of Safe-Control, including the preliminary generation of training data and the training stage.
  • Figure 3: The prompt template for generating safe prompts corresponding to unsafe prompts in ChatGPT 3.5.
  • Figure 4: The results of $\textit{Safe-Control}\xspace_{nudity}$ and baselines in reducing the generation of various exposed body parts, where ‘M’ stands for male and ‘F’ stands for female.
  • Figure 5: Comparision between our proposed Safe-Control and SLD on I2P prompt dataset. The different points of each method represent different safety level configurations. The "h1" to "h5" are the different weight combination strategies of multiple Safe-Control, with safety levels ranging from low to high.
  • ...and 2 more figures