Table of Contents
Fetching ...

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang

TL;DR

Vision-centric jailbreaks reveal a safety gap where malicious instructions can be embedded in visual prompts, bypassing text-focused safeguards. The authors introduce VJA, a visual-to-visual jailbreak, and IESBench, a comprehensive safety benchmark for vision-based image editing, coupled with an introspection-based defense that requires no extra guard models. Empirical results show strong attack effectiveness across commercial and open-source models, while the proposed defense significantly mitigates risk with minimal overhead. This work provides benchmarks, analysis, and practical defenses to advance safe and trustworthy image editing systems in multimodal settings.

Abstract

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

TL;DR

Vision-centric jailbreaks reveal a safety gap where malicious instructions can be embedded in visual prompts, bypassing text-focused safeguards. The authors introduce VJA, a visual-to-visual jailbreak, and IESBench, a comprehensive safety benchmark for vision-based image editing, coupled with an introspection-based defense that requires no extra guard models. Empirical results show strong attack effectiveness across commercial and open-source models, while the proposed defense significantly mitigates risk with minimal overhead. This work provides benchmarks, analysis, and practical defenses to advance safe and trustworthy image editing systems in multimodal settings.

Abstract

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
Paper Structure (34 sections, 5 equations, 19 figures, 6 tables)

This paper contains 34 sections, 5 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Comparison between our Vision-Centric Jailbreak Attack (VJA) and conventional Text-Centric Jailbreak Attacks. Top: Attack scheme comparison; Bottom: Performance comparison on a subset of $\mathtt{IESBench}$, VJA achieves significantly higher attack success rates across four commercial models.
  • Figure 2: Overview and statistics of our constructed $\mathtt{IESBench}$. Note that the proposed VJA is vision-only jailbreak attack, so no additional text prompts are needed.
  • Figure 3: Introspection-based Defense, which leverages a safety trigger to enhance the security of large image editing models.
  • Figure 4: The illustration of $\mathtt{IESBench}$ construction. The top figure shows the 15 risk category covered in our $\mathtt{IESBench}$ in a hierarchical manner, and the bottom figure shows the pipeline for dataset curation and evaluation.
  • Figure 5: Average harmfulness score comparison between different models on $\mathtt{IESBench}$. (a) shows the distribution of samples in different levels of our $\mathtt{IESBench}$. (b)-(i) illustrate the average HS of models for attacks in different risk levels.
  • ...and 14 more figures