Table of Contents
Fetching ...

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position

Zhixin Xie, Xurui Song, Jun Luo

TL;DR

This work investigates safety in diffusion LLMs (dLLMs) and identifies a unique security asymmetry: middle tokens are the most critical for safety, while attackers have limited ability to influence them due to dLLMs’ sequential generation bias. Leveraging this, the authors propose Middle-tOken Safety Alignment (MOSA), a reinforcement-learning-based method that explicitly aligns middle-token generation with predefined safe refusals within a middle window around tokens $20$ to $60$, plus a KL-divergence penalty to preserve utility. MOSA is implemented on a Llada family model with LoRA and evaluated against eight jailbreak attacks on AdvBench and HarmBench; it dramatically reduces attack success rates while maintaining near-original performance on coding, math, and reasoning tasks. The results demonstrate an architecture-aware defense for dLLMs that outperforms traditional initial-token defenses and generalizes across multiple models, highlighting practical implications for deploying safer diffusion-based LLMs in real-world settings.

Abstract

Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs' safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a sequential generation order in practice, forcing the attack to meet this distribution and diverting it from influencing the critical middle tokens. Building on this asymmetry, we introduce Middle-tOken Safety Alignment (MOSA), a novel method that directly aligns the model's middle generation with safe refusals exploiting reinforcement learning. We implement MOSA and compare its security performance against eight attack methods on two benchmarks. We also test the utility of MOSA-aligned dLLM on coding, math, and general reasoning. The results strongly prove the superiority of MOSA.

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position

TL;DR

This work investigates safety in diffusion LLMs (dLLMs) and identifies a unique security asymmetry: middle tokens are the most critical for safety, while attackers have limited ability to influence them due to dLLMs’ sequential generation bias. Leveraging this, the authors propose Middle-tOken Safety Alignment (MOSA), a reinforcement-learning-based method that explicitly aligns middle-token generation with predefined safe refusals within a middle window around tokens to , plus a KL-divergence penalty to preserve utility. MOSA is implemented on a Llada family model with LoRA and evaluated against eight jailbreak attacks on AdvBench and HarmBench; it dramatically reduces attack success rates while maintaining near-original performance on coding, math, and reasoning tasks. The results demonstrate an architecture-aware defense for dLLMs that outperforms traditional initial-token defenses and generalizes across multiple models, highlighting practical implications for deploying safer diffusion-based LLMs in real-world settings.

Abstract

Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs' safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a sequential generation order in practice, forcing the attack to meet this distribution and diverting it from influencing the critical middle tokens. Building on this asymmetry, we introduce Middle-tOken Safety Alignment (MOSA), a novel method that directly aligns the model's middle generation with safe refusals exploiting reinforcement learning. We implement MOSA and compare its security performance against eight attack methods on two benchmarks. We also test the utility of MOSA-aligned dLLM on coding, math, and general reasoning. The results strongly prove the superiority of MOSA.

Paper Structure

This paper contains 32 sections, 1 equation, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: The security asymmetry in AR vs. dLLM architectures. In AR models, attackers and defenders compete for control over the same initial tokens. In dLLMs, the attacker's influence is restricted to the sequence start, potentially allowing the defender to strategically align the middle tokens.
  • Figure 2: An example of prefilling.
  • Figure 3: Prefilling position vs. attack performance.
  • Figure 4: The loss of optimization of different tokens.
  • Figure 5: The generation preferences of dLLMs.
  • ...and 2 more figures