PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan; Xinfeng Li; Chejian Xu; Guanhong Tao; Xiaojun Jia; Yihao Huang; Wei Dong; Yang Liu; Xiaofeng Wang; Bo Li

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Xiaofeng Wang, Bo Li

TL;DR

PromptGuard tackles NSFW generation in text-to-image models by learning a universal soft prompt $P^*$ embedded in the text encoder, delivering lightweight, parameter-free safety guidance. It adopts a divide-and-conquer strategy with category-specific embeddings and a contrastive/adversarial training objective to suppress unsafe outputs while preserving benign image quality. Across five benchmarks and eight baselines, PromptGuard achieves a low unsafe ratio (as low as $5.84\%$) and demonstrates robustness under adversarial attacks, with a reported 3.8x speedup over prior methods and scalable support for new NSFW categories. The approach is transferable across SD-based models sharing the same text encoder, enabling practical deployment with minimal additional computational cost and broad applicability to future T2I architectures.

Abstract

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

TL;DR

PromptGuard tackles NSFW generation in text-to-image models by learning a universal soft prompt

embedded in the text encoder, delivering lightweight, parameter-free safety guidance. It adopts a divide-and-conquer strategy with category-specific embeddings and a contrastive/adversarial training objective to suppress unsafe outputs while preserving benign image quality. Across five benchmarks and eight baselines, PromptGuard achieves a low unsafe ratio (as low as

) and demonstrates robustness under adversarial attacks, with a reported 3.8x speedup over prior methods and scalable support for new NSFW categories. The approach is transferable across SD-based models sharing the same text encoder, enabling practical deployment with minimal additional computational cost and broad applicability to future T2I architectures.

Abstract

Paper Structure (36 sections, 3 equations, 8 figures, 14 tables)

This paper contains 36 sections, 3 equations, 8 figures, 14 tables.

Introduction
Related Work
Content Moderation
Model Alignment
Background
Text-to-Image (T2I) Generation
System Prompt
PromptGuard
Overview
Training Data Preparation
Individual Soft Prompt Embedding Training
Inference
Experiments
Experiment Setup
NSFW Content Moderation
...and 21 more sections

Figures (8)

Figure 1: Unlike existing moderation frameworks that rely on additional models to check or detoxify NSFW content, PromptGuard presents an efficient, universal soft prompt, $P_*$, inspired by the system prompt mechanism in LLMs, to directly moderates NSFW inputs and generate safe yet realistic content.
Figure 2: Diagram of PromptGuard. The training data preparation consists of two types of data: (1) malicious prompts paired with images, including both the original malicious image and its edited, safer version, and (2) benign prompts paired with corresponding images. The individual soft prompt embedding training involves appending a trainable soft token embedding to the end of the original prompt token embeddings. Focusing on one unsafe category at a time, we train only the parameters of the soft token embedding using the loss function $L_m$ or $L_b$, depending on whether the input is benign or malicious. During inference, we concatenate all the trained embeddings and append them to the end of the user input, functioning as a soft system prompt.
Figure 3: SDEdit SDEdit could help to build fine-grained image pair for malicious data, which only modifies the unsafe vision region.
Figure 4: PromptGuard successfully moderates the unsafe content across four categories. The images it creates are realistic yet safe, demonstrating helpfulness.
Figure 5: Adversarial robustness against three red-teaming settings: SneakyPrompt-N (natural words), SneakyPrompt-P (pseudo words), and MMA-Diffusion (pseudo words).
...and 3 more figures

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

TL;DR

Abstract

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)