Empirical Analysis of Decoding Biases in Masked Diffusion Models

Pengcheng Huang; Tianming Liu; Zhenghao Liu; Yukun Yan; Shuo Wang; Tong Xiao; Zulong Chen; Maosong Sun

Empirical Analysis of Decoding Biases in Masked Diffusion Models

Pengcheng Huang, Tianming Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Tong Xiao, Zulong Chen, Maosong Sun

TL;DR

The paper investigates decoding biases in Masked Diffusion Models and identifies two key problems under uncertainty-driven decoding: rigid boundary-first trajectories and overemphasis on trivial tokens. It introduces Uncode, a lightweight decoding-calibration framework with a Positional Trajectory Prior and a Semantic Informativeness Prior, to adjust unmasking priorities during inference. Empirical results across seven benchmarks and multiple backbones show Uncode yields substantial performance gains (average >7%) and achieves competition with autoregressive models while retaining efficiency gains with existing decoding strategies. The findings illuminate how controlling decoding order can unlock the reasoning and planning capabilities of MDMs, with practical impact on deploying high-quality non-autoregressive language generation. The approach is robust to calibration corpus choices and can be integrated with various efficient decoding methods, offering a practical path to faster, more reliable MDM deployments.

Abstract

Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes are available at https://github.com/NEUIR/Uncode.

Empirical Analysis of Decoding Biases in Masked Diffusion Models

TL;DR

Abstract

Empirical Analysis of Decoding Biases in Masked Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)