Table of Contents
Fetching ...

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu

TL;DR

The proposed NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding, and suggests that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs.

Abstract

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

TL;DR

The proposed NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding, and suggests that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs.

Abstract

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
Paper Structure (27 sections, 5 equations, 7 figures, 5 tables)

This paper contains 27 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Visualization of decoding dynamics. We plot the token position being unmasked (y-axis) against the decoding step (x-axis). (a, b) Despite using confidence-based Arbitrary Order (AO) decoding, standard DLMs (LLaDA and Dream) exhibit a strict linear diagonal pattern, revealing that their behavior collapses into autoregressive (left-to-right) generation. (c) Random decoding eliminates AR bias but lacks structure. (d) Our method (NAP) breaks the single-stream bottleneck, generating multiple reasoning trajectories simultaneously.
  • Figure 2: Performance on GSM8K (left) and MATH-500 (right). Forcing low-ARness behavior (Random decoding) generally causes reasoning performance to collapse. Notably, for LLaDA, we employ a constrained block-wise decoding strategy to ensure generation validity. This preserves local structural integrity, resulting in the Arbitrary Order (AO) decoding maintaining comparable performance, unlike the sharp drop observed in fully unstructured random decoding.
  • Figure 3: Sequential Dependence (SeqDep) Analysis on (a) OpenR1-Math and (b) FineWeb Datasets. The consistently high and rising SeqDep scores indicate that standard training corpora possess strong intrinsic sequentiality, driving models to internalize AR-like dependencies.
  • Figure 4: Long-CoT Supervision Increases ARness. The positive deltas show models converging toward strict left-to-right generation (1.0), confirming that current supervision methods actively discourage non-autoregressive parallel decoding.
  • Figure 5: A compact training instance. The model generates parallel paths (including distinct methods and a noisy path) and aggregates them into a correct summary.
  • ...and 2 more figures