Table of Contents
Fetching ...

An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)

Ted Kwartler, Nataliia Bagan, Ivan Banny, Alan Aqrawi, Arian Abbasi

TL;DR

This paper extends the Single-Turn Crescendo Attack (STCA) from text-only models to text-to-image generation by embedding a crescendo-like adversarial narrative into a single prompt (STCA-3) and testing against DALL-E 3 with Flux Schnell as an uncensored baseline. Through large-scale raw-prompt generation and a hidden meta-prompt to structure three turns, the study demonstrates that STCA prompts substantially increase unsafe image outputs, bringing censored models closer to uncensored baselines. A multimodal evaluation pipeline combines a GPT-4o safety classifier and human review to quantify guardrail bypass, providing a scalable framework for red-teaming guardrails in multimodal generative AI. The findings underscore significant safety risks and motivate the development of stronger guardrails, detection methods, and ongoing monitoring across image-, audio-, and video-generation systems.

Abstract

The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.

An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)

TL;DR

This paper extends the Single-Turn Crescendo Attack (STCA) from text-only models to text-to-image generation by embedding a crescendo-like adversarial narrative into a single prompt (STCA-3) and testing against DALL-E 3 with Flux Schnell as an uncensored baseline. Through large-scale raw-prompt generation and a hidden meta-prompt to structure three turns, the study demonstrates that STCA prompts substantially increase unsafe image outputs, bringing censored models closer to uncensored baselines. A multimodal evaluation pipeline combines a GPT-4o safety classifier and human review to quantify guardrail bypass, providing a scalable framework for red-teaming guardrails in multimodal generative AI. The findings underscore significant safety risks and motivate the development of stronger guardrails, detection methods, and ongoing monitoring across image-, audio-, and video-generation systems.

Abstract

The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.

Paper Structure

This paper contains 19 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Harmful image generation rates in DALL-E 3 for normal prompts and under STCA attack. The stacked bars show the distribution of responses between unsafe (red), safe (yellow), and rejected (green) outputs for both normal prompting and STCA-3 attack scenarios. The dotted lines represent baseline measurements from an uncensored model (Flux Schnell) against the same set of prompts, with the blue line showing the normal baseline rate of unsafe outputs and the cyan line indicating the elevated unsafe output rate when using STCA-3 prompts. The results demonstrate how STCA-3 prompts not only bypass DALL-E 3's safety mechanisms but also achieve harmful generation rates comparable to an uncensored model with normal prompts, representing a significant circumvention of the model's safety filters.