Table of Contents
Fetching ...

Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization

Portia Cooper, Harshita Narnoli, Mihai Surdeanu

TL;DR

This work tackles safety vulnerabilities in text-to-image generation arising from divide-and-conquer adversarial prompts (DACA). It proposes a two-layer defense that first summarizes prompts to strip obfuscation and then classifies content as appropriate or inappropriate. On the ATTIP dataset, encoder-based summarization yields the strongest detection performance, achieving up to about 98% F1 on the inappropriate class, with explanations via LIME indicating improved interpretability. The findings support integrating summarization into moderation pipelines to harden defenses against obfuscated prompts and enhance robustness of content-filtering in image-generation systems.

Abstract

Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer Attack" (DACA) that utilize a large language model to obfuscate inappropriate content in prompts by wrapping sensitive text in a benign narrative. To mitigate stepwise DACA attacks, we propose a two-layer method involving text summarization followed by binary classification. We assembled the Adversarial Text-to-Image Prompt (ATTIP) dataset ($N=940$), which contained DACA-obfuscated and non-obfuscated prompts. From the ATTIP dataset, we created two summarized versions: one generated by a small encoder model and the other by a large language model. Then, we used an encoder classifier and a GPT-4o classifier to perform content moderation on the summarized and unsummarized prompts. When compared with a classifier that operated over the unsummarized data, our method improved F1 score performance by 31%. Further, the highest recorded F1 score achieved (98%) was produced by the encoder classifier on a summarized ATTIP variant. This study indicates that pre-classification text summarization can inoculate content detection models against stepwise DACA obfuscations.

Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization

TL;DR

This work tackles safety vulnerabilities in text-to-image generation arising from divide-and-conquer adversarial prompts (DACA). It proposes a two-layer defense that first summarizes prompts to strip obfuscation and then classifies content as appropriate or inappropriate. On the ATTIP dataset, encoder-based summarization yields the strongest detection performance, achieving up to about 98% F1 on the inappropriate class, with explanations via LIME indicating improved interpretability. The findings support integrating summarization into moderation pipelines to harden defenses against obfuscated prompts and enhance robustness of content-filtering in image-generation systems.

Abstract

Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer Attack" (DACA) that utilize a large language model to obfuscate inappropriate content in prompts by wrapping sensitive text in a benign narrative. To mitigate stepwise DACA attacks, we propose a two-layer method involving text summarization followed by binary classification. We assembled the Adversarial Text-to-Image Prompt (ATTIP) dataset (), which contained DACA-obfuscated and non-obfuscated prompts. From the ATTIP dataset, we created two summarized versions: one generated by a small encoder model and the other by a large language model. Then, we used an encoder classifier and a GPT-4o classifier to perform content moderation on the summarized and unsummarized prompts. When compared with a classifier that operated over the unsummarized data, our method improved F1 score performance by 31%. Further, the highest recorded F1 score achieved (98%) was produced by the encoder classifier on a summarized ATTIP variant. This study indicates that pre-classification text summarization can inoculate content detection models against stepwise DACA obfuscations.

Paper Structure

This paper contains 18 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Divide and conquer attack (DACA) that "hides a wolf in sheep's clothing" - sub-figure (a) shows an appropriate prompt cleared by the content detection filter; (b) shows an inappropriate prompt flagged by the filter; (c) shows an inappropriate prompt altered by DACA obfuscation that bypassed the filter; and (d) shows a summarized version of the obfuscated prompt, with the "fluff" removed, failed to pass the filter.
  • Figure 2: Poor, fair, and high quality label distribution across the LIME plots generated by encoder classifier on data from the ATTIP baseline dataset, encoder summaries, and GPT-4o summaries.
  • Figure 3: High quality explanation plot generated on encoder summarized variant of inappropriate prompt (high quality label was assigned as eight of the top ten weighted terms were correctly classified).
  • Figure 4: High quality explanation plot generated on GPT-4o summarized variant of appropriate prompt (high quality label was assigned as all of the weighted terms were correctly classified).
  • Figure 5: Poor quality explanation plot generated on encoder summarized variant of appropriate prompt (poor quality label was assigned as three of the top ten weighted terms were correctly classified).
  • ...and 1 more figures