An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)
Ted Kwartler, Nataliia Bagan, Ivan Banny, Alan Aqrawi, Arian Abbasi
TL;DR
This paper extends the Single-Turn Crescendo Attack (STCA) from text-only models to text-to-image generation by embedding a crescendo-like adversarial narrative into a single prompt (STCA-3) and testing against DALL-E 3 with Flux Schnell as an uncensored baseline. Through large-scale raw-prompt generation and a hidden meta-prompt to structure three turns, the study demonstrates that STCA prompts substantially increase unsafe image outputs, bringing censored models closer to uncensored baselines. A multimodal evaluation pipeline combines a GPT-4o safety classifier and human review to quantify guardrail bypass, providing a scalable framework for red-teaming guardrails in multimodal generative AI. The findings underscore significant safety risks and motivate the development of stronger guardrails, detection methods, and ongoing monitoring across image-, audio-, and video-generation systems.
Abstract
The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.
