Table of Contents
Fetching ...

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu

TL;DR

<3-5 sentence high-level summary> GuardT2I addresses safety concerns in text-to-image generation by introducing a generative defense that uses a conditional LLM to translate latent guidance into textual prompt interpretations, enabling detection of adversarial prompts without compromising generation quality or adding latency. The method employs a bi-level parse with a Verbalizer and a Sentence Similarity Checker to halt unsafe prompts, and is trained on a mapped guidance embedding dataset derived from unfiltered LAION-COCO prompts. Extensive experiments show GuardT2I outperforms commercial baselines like OpenAI-Moderation and Microsoft Azure Moderator across diverse NSFW prompts and remains robust under adaptive attacks, while providing interpretable decisions. The approach is open-source, scalable, and integrates in parallel with T2I generation, offering practical threat mitigation for real-world T2I services.</paper_summary>

Abstract

Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at https://github.com/cure-lab/GuardT2I.

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

TL;DR

<3-5 sentence high-level summary> GuardT2I addresses safety concerns in text-to-image generation by introducing a generative defense that uses a conditional LLM to translate latent guidance into textual prompt interpretations, enabling detection of adversarial prompts without compromising generation quality or adding latency. The method employs a bi-level parse with a Verbalizer and a Sentence Similarity Checker to halt unsafe prompts, and is trained on a mapped guidance embedding dataset derived from unfiltered LAION-COCO prompts. Extensive experiments show GuardT2I outperforms commercial baselines like OpenAI-Moderation and Microsoft Azure Moderator across diverse NSFW prompts and remains robust under adaptive attacks, while providing interpretable decisions. The approach is open-source, scalable, and integrates in parallel with T2I generation, offering practical threat mitigation for real-world T2I services.</paper_summary>

Abstract

Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at https://github.com/cure-lab/GuardT2I.
Paper Structure (29 sections, 5 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of GuardT2I. GuardT2I can effectively halt the generation process of adversarial prompts to avoid NSFW generations, without compromising normal prompts or increasing inference time.
  • Figure 2: The Workflow of GuardT2I against Adversarial Prompts.(a)GuardT2I halts the generation process of adversarial prompts. (b) Within GuardT2I, the c$\cdot$LLM translates the latent guidance embedding e into natural language, accurately reflecting the user's intent. (c) A double-folded generation parse detects adversarial prompts. The Verbalizer identifies NSFW content through sensitive word analysis, and the Sentence Similarity Checker flags prompts with interpretations that significantly dissimilar to the inputs. (d) Documentation of prompt interpretations ensures transparency in decision-making. ★ aims to avoid offenses.
  • Figure 3: Architecture of c$\cdot$LLM. T2I's text guidance embedding e is fed to c$\cdot$LLM through the multi-head cross attention layer's query entry. L indicates the total number of transformer blocks.
  • Figure 4: Workflow of Sentence Similarity Checker. (a) Normal Prompt: In the case of a normal prompt, its prompt interpretation closely aligns with the original prompt, resulting in a SFW decision. (b) Adversarial Prompt: Conversely, for an adversarial prompt, its prompt interpretation significantly differs from the original prompt both, therefore be identified.
  • Figure 5: ROC curves of our GuardT2I and baselines against various adversarial prompts. The black line represents the GuardT2I model's consistent and high AUROC scores across different thresholds.
  • ...and 6 more figures