JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Haolun Zheng; Yu He; Tailun Chen; Shuo Shao; Zhixuan Chu; Hongbin Zhou; Lan Tao; Zhan Qin; Kui Ren

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin Zhou, Lan Tao, Zhan Qin, Kui Ren

Abstract

Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Abstract

Paper Structure (31 sections, 43 equations, 8 figures, 4 tables)

This paper contains 31 sections, 43 equations, 8 figures, 4 tables.

Introduction
Background & Related Work
Text-to-Image Generation
Jailbreak Attacks on T2I Models
Methodology
Threat Model
Overview of JANUS
Stage 1: Semantically-Anchored Distribution Modeling
Stage 2: Policy-based Black-box Optimization
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusion
Experiments & Details
...and 16 more sections

Figures (8)

Figure 1: Qualitative results of JANUS on Stable Diffusion 3.5 Large Turbo (left) and DALL$\cdot$E3 (right). JANUS rewrites unsafe target prompts into distributionally optimized, ostensibly benign queries that bypass both text- and image-level safety filters, yet still induce model outputs aligned with the original prohibited intent.
Figure 2: Overall pipeline of our JANUS. Stage 1 builds two semantically anchored base distributions from the target prompt $\mathbf{p}_t$ and its clean counterpart $\mathbf{p}_c$, then mixes them into a parameterized prompt distribution $\mathbf{p}_\alpha$. Stage 2 performs black-box policy optimization: samples from $\mathbf{p}_\alpha$ are evaluated by the T2I model, and the bypass/NSFW feedback updates the mixing policy $\alpha$.
Figure 3: Qualitative results of JANUS on Stable Diffusion XL (left) and Midjourney (right).
Figure 4: Effect of the mixing policy $\alpha$ on jailbreak performance for SD3.5LT (left) and DALL$\cdot$E3 (right). The left y-axis reports TASR / IASR / ASR (%), while the right y-axis reports the NSFW score. Fixing $\alpha$ to any static value leads to a suboptimal trade-off between filter evasion and content harmlessness. Our full framework ("Fully Trained") uses RL to learn a dynamic $\alpha$ policy, achieving superior overall jailbreak performance.
Figure 5: More qualitative results of JANUS on Stable Diffusion 3.5 Large Turbo (left) and DALL$\cdot$E3 (right).
...and 3 more figures

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Abstract

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Authors

Abstract

Table of Contents

Figures (8)