Table of Contents
Fetching ...

DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

Xueyu Zhou, Yangrong Hu, Jian Huang

Abstract

Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.

DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

Abstract

Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.
Paper Structure (33 sections, 14 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 33 sections, 14 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Accuracy on GSM8K using LLaDA-Instruct-8B nie2025large with a fixed generation length of 512 tokens. Block size 32 corresponds to block-wise decoding, while 512 represents the single-block setting. Existing methods (Fast-dLLM wu2025fast, KLASSkim2025klass, Confidencechang2022maskgit) degrade under large block sizes, whereas DOS (ours) remains consistent and robust across both settings.
  • Figure 2: A toy example for parallel decoding in MDLMs, where $C$ denotes the prompt or unmasked tokens and $\{X_i\}_{i=1}^4$ are masked tokens. Different factorizations of $p(X_1,X_2,X_3,X_4|C)$ correspond to different parallel decoding orders. Only factorizations that respect the underlying dependency structure are able to recover the true joint distribution, while improper independence assumptions lead to distribution mismatch.
  • Figure 3: Accuracy on HumanEval using LLaDA-Instruct-8B with a fixed generation length of 256 under varying block sizes. Existing decoding strategies are sensitive to block size and degrade as the block size increases, whereas DOS (ours) demonstrates strong robustness to block size variation and maintains superior consistency across all settings.
  • Figure 4: Accuracy of DOS and DOS+EB on the HumanEval benchmark using the LLaDA-Instruct-8B model, where the x-axis indicates the transformer layer from which the attention matrix is extracted.
  • Figure 5: Accuracy on HumanEval using Dream-v0-Instruct-7B with a fixed generation length of 256 under varying block sizes. Existing decoding strategies are sensitive to block size and degrade as the block size increases, whereas DOS (ours) demonstrates strong robustness to block size variation and maintains superior consistency across all settings.

Theorems & Definitions (3)

  • Definition 1: Confidence
  • Definition 2: Entropy
  • Definition 3: Margin confidence