Table of Contents
Fetching ...

Empirical Analysis of Decoding Biases in Masked Diffusion Models

Pengcheng Huang, Tianming Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Tong Xiao, Zulong Chen, Maosong Sun

TL;DR

The paper investigates decoding biases in Masked Diffusion Models and identifies two key problems under uncertainty-driven decoding: rigid boundary-first trajectories and overemphasis on trivial tokens. It introduces Uncode, a lightweight decoding-calibration framework with a Positional Trajectory Prior and a Semantic Informativeness Prior, to adjust unmasking priorities during inference. Empirical results across seven benchmarks and multiple backbones show Uncode yields substantial performance gains (average >7%) and achieves competition with autoregressive models while retaining efficiency gains with existing decoding strategies. The findings illuminate how controlling decoding order can unlock the reasoning and planning capabilities of MDMs, with practical impact on deploying high-quality non-autoregressive language generation. The approach is robust to calibration corpus choices and can be integrated with various efficient decoding methods, offering a practical path to faster, more reliable MDM deployments.

Abstract

Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes are available at https://github.com/NEUIR/Uncode.

Empirical Analysis of Decoding Biases in Masked Diffusion Models

TL;DR

The paper investigates decoding biases in Masked Diffusion Models and identifies two key problems under uncertainty-driven decoding: rigid boundary-first trajectories and overemphasis on trivial tokens. It introduces Uncode, a lightweight decoding-calibration framework with a Positional Trajectory Prior and a Semantic Informativeness Prior, to adjust unmasking priorities during inference. Empirical results across seven benchmarks and multiple backbones show Uncode yields substantial performance gains (average >7%) and achieves competition with autoregressive models while retaining efficiency gains with existing decoding strategies. The findings illuminate how controlling decoding order can unlock the reasoning and planning capabilities of MDMs, with practical impact on deploying high-quality non-autoregressive language generation. The approach is robust to calibration corpus choices and can be integrated with various efficient decoding methods, offering a practical path to faster, more reliable MDM deployments.

Abstract

Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes are available at https://github.com/NEUIR/Uncode.

Paper Structure

This paper contains 26 sections, 16 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Illustration of decoding biases during reasoning. Uncertainty-based decoding prioritizes (A) rigid boundaries and (B) trivial tokens, causing systematic deviations from optimal problem-solving paths. In contrast, panel (C) presents the ideal unmasking pattern for producing coherent logical chains to achieve reliable inference.
  • Figure 2: Visualization of the Rigid Boundary Bias and its impact on downstream performance. (a) Unmasking probability for each token position across decoding steps on GSM8K, with both sequence length and decoding steps set to 256, where darker blue intensities denote higher unmasking probabilities. (b) Accuracy comparison of different decoding strategies on GSM8K (reasoning) and Sudoku (planning).
  • Figure 3: Verification of trivial token bias and trivial-token suppression efficacy. (a) The proportion of trivial tokens unmasked at each decoding step under the uncertainty-based sampler consistently exceeds the AR baseline (Qwen-2.5-7B-Instruct, dashed line). (b) GSM8K accuracy improves monotonically with suppression probability $p$.
  • Figure 4: Ablation results of individual modules on LLaDA-8B-Instruct and LLaDA-1.5-8B, reporting the average performance across all evaluation benchmarks.
  • Figure 5: Analysis of uncertainty dynamics during decoding. (a) Answer Entropy vs. Decoding Step. While the baseline (Confidence, blue) often prematurely unmasks answer tokens with high uncertainty, Uncode (red) delays these answer tokens to later steps (bottom-right cluster), ensuring they are generated with high confidence. (b) Uncode achieves faster global uncertainty reduction than the baseline.
  • ...and 12 more figures