Table of Contents
Fetching ...

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda

Abstract

Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Abstract

Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

Paper Structure

This paper contains 31 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of low-effort prompt-based jailbreak attacks on text-to-image systems. Direct requests for unsafe content are typically blocked by safety filters, but simple prompt modifications can bypass moderation and produce restricted imagery. These attacks require no model access or technical expertise, highlighting the accessibility of jailbreak strategies in modern generative media systems.
  • Figure 2: Illustration of a typical moderation pipeline in text-to-image systems. Safety mechanisms are applied at multiple stages, including prompt filtering, semantic validation, image generation, and post-generation visual moderation.
  • Figure 3: Examples of images generated using Artistic Reframing Attacks (ARA), where unsafe content is embedded within artistic or historical contexts.
  • Figure 4: Examples of images generated using Lifestyle Subculture Aesthetic Attacks (LSAA), where unsafe elements are masked within stylistically rich prompts.
  • Figure 5: Examples of images generated using Pseudo-Educational Framing Attacks (PEFA), where unsafe content is presented in instructional or scientific formats.
  • ...and 2 more figures