Table of Contents
Fetching ...

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Yufei Guo, Jiaheng Zhang

Abstract

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Abstract

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Paper Structure

This paper contains 38 sections, 10 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Left. The generation diagram of dLLMs; Middle. The unique vulnerabilities of dLLMs, including the intra-step and the inter-step level; Right.DiffuGuard framework achieves significant safety improvements while having minimal impact on model performance and inference latency.
  • Figure 2: Safety Capabilities of LLaDA under Different Scenarios. The analysis is based on the first 3 generation steps, focusing on the first 8 token positions of the output sequence. (a)(b)(c) respectively show the logits for safe, malicious, and jailbreak queries, which are visualized as heatmaps at the output layer. (d) represents the token distribution at Layer 27 under a jailbreak query.
  • Figure 3: Impact of randomness in remask strategies on the safety-quality trade-off.
  • Figure 4: Effect of Initial Tokens on dLLM ASR. We compare the final safety performance when guiding generation with unsafe tokens (e.g., "Sure") versus safe tokens (e.g., "Sorry"), benchmarked against various baseline methods.
  • Figure 5: ASR as a Function of the Safe Token Injection Step. The experiment was conducted over 64 generation steps, where we forcibly set the first position to "Sorry" at various steps (1, 2, 4, 8, 16, and 32) and recorded the final ASR.
  • ...and 6 more figures