Table of Contents
Fetching ...

SafeText: Safe Text-to-image Models via Aligning the Text Encoder

Yuepeng Hu, Zhengyuan Jiang, Neil Zhenqiang Gong

TL;DR

This work addresses the safety risk of text-to-image models generating harmful content by proposing SafeText, which fine-tunes only the text encoder to shift embeddings of unsafe prompts while preserving embeddings for safe prompts. The method introduces two loss terms, $L_e$ and $L_u$, and optimizes $\min_{\tau_s} L_u - \lambda L_e$ using separate safe and unsafe prompt datasets, solved via standard gradient-based optimization. Empirical results across multiple models and datasets show SafeText achieves high effectiveness (NRR > 98%) against manually crafted and jailbreak prompts, while maintaining strong utility (LPIPS and FID) on safe prompts, outperforming six baseline methods. The approach provides a practical, diffusion-preserving safety mechanism with broad generalization, and the authors plan to release code and data to support public use and benchmarking.

Abstract

Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainly focus on modifying the diffusion module to prevent harmful image generation. However, this often significantly impacts the model's behavior for safe prompts, causing substantial quality degradation of generated images. In this work, we propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. By adjusting the text encoder, SafeText significantly alters the embedding vectors for unsafe prompts, while minimally affecting those for safe prompts. As a result, the diffusion module generates non-harmful images for unsafe prompts while preserving the quality of images for safe prompts. We evaluate SafeText on multiple datasets of safe and unsafe prompts, including those generated through jailbreak attacks. Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts, and SafeText outperforms six existing alignment methods. We will publish our code and data after paper acceptance.

SafeText: Safe Text-to-image Models via Aligning the Text Encoder

TL;DR

This work addresses the safety risk of text-to-image models generating harmful content by proposing SafeText, which fine-tunes only the text encoder to shift embeddings of unsafe prompts while preserving embeddings for safe prompts. The method introduces two loss terms, and , and optimizes using separate safe and unsafe prompt datasets, solved via standard gradient-based optimization. Empirical results across multiple models and datasets show SafeText achieves high effectiveness (NRR > 98%) against manually crafted and jailbreak prompts, while maintaining strong utility (LPIPS and FID) on safe prompts, outperforming six baseline methods. The approach provides a practical, diffusion-preserving safety mechanism with broad generalization, and the authors plan to release code and data to support public use and benchmarking.

Abstract

Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainly focus on modifying the diffusion module to prevent harmful image generation. However, this often significantly impacts the model's behavior for safe prompts, causing substantial quality degradation of generated images. In this work, we propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. By adjusting the text encoder, SafeText significantly alters the embedding vectors for unsafe prompts, while minimally affecting those for safe prompts. As a result, the diffusion module generates non-harmful images for unsafe prompts while preserving the quality of images for safe prompts. We evaluate SafeText on multiple datasets of safe and unsafe prompts, including those generated through jailbreak attacks. Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts, and SafeText outperforms six existing alignment methods. We will publish our code and data after paper acceptance.

Paper Structure

This paper contains 16 sections, 4 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 3: (a) NRR on NSFW and (b) LPIPS on MS-COCO for SafeText with different distance metrics and $\lambda$ values. Controlled experiments to assess the impact of embedding direction and magnitude on (c) harmfulness of images for unsafe prompts and (d) utility of images for safe prompts.
  • Figure 4: NRR on NSFW and LPIPS on MS-COCO of our SafeText with different (a) number of epochs, (b) learning rates, and (c) batch sizes.
  • Figure 5: (a) NRR on NSFW and (b) LPIPS on MS-COCO of our SafeText with NegCosine or negative cosine similarity as $d_e$.
  • Figure 6: Images generated by SDXL without alignment (first row) and with our SafeText (second row) for eight unsafe prompts.
  • Figure 7: Images generated by DP without alignment (first row) and with our SafeText (second row) for eight unsafe prompts.
  • ...and 10 more figures