Table of Contents
Fetching ...

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang

TL;DR

This work presents DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs, and demonstrates that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures.

Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

TL;DR

This work presents DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs, and demonstrates that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures.

Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

Paper Structure

This paper contains 39 sections, 6 equations, 9 figures, 20 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of practical applications enabled by interleaved mask-text prompting in dLLMs, including instruction editing, formatted generation, and structured information extraction.
  • Figure 2: Comparison of PAIR and DiJA on LLaDA-1.5. While PAIR is blocked by a safety response, DiJA bypasses safeguards via interleaved mask-text jailbreak prompts.
  • Figure 3: Comparison of the defensive capabilities of diffusion-based and autoregressive LLMs across three jailbreak benchmarks: (a) under the AIM attack (to avoid missing bars due to zero values, all ASR-k scores are uniformly offset by +5%), and (b) under the PAIR attack. Additional experimental results can be found in Figure \ref{['fig:dllm_vs_arm_asr-e']} and Figure \ref{['fig:dllm_vs_arm_zeroshot']} of Appendix \ref{['app_sec:comprehensive_dllm_defense']}.
  • Figure 4: Illustrative cases of harmful completions generated by four dLLMs when attacked by DiJA. The red text represents harmful content generated by dLLMs under DiJA attack.
  • Figure 5: Jailbreaking evaluator-based attack success rate (ASR-e) or StrongREJECT score (SRS) on two defense mechanisms on three victim dLLMs across multiple jailbreak benchmarks.
  • ...and 4 more figures