Table of Contents
Fetching ...

Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models

Luhan Tang, Longxuan Yu, Shaorong Zhang, Greg Ver Steeg

TL;DR

It is shown that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length.

Abstract

Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler-induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model. We introduce a sampler-centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground-truth Markov chain, isolating sampler-induced error in a controlled setting. We show that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log-likelihood, generative perplexity, or MAUVE do not imply correct sampling. Code is available at https://luhantang.github.io/dllm_sampler

Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models

TL;DR

It is shown that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length.

Abstract

Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler-induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model. We introduce a sampler-centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground-truth Markov chain, isolating sampler-induced error in a controlled setting. We show that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log-likelihood, generative perplexity, or MAUVE do not imply correct sampling. Code is available at https://luhantang.github.io/dllm_sampler
Paper Structure (77 sections, 102 equations, 9 figures, 4 tables)

This paper contains 77 sections, 102 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Transition-level metrics on OpenWebText (OWT) under an oracle denoiser. We report transition KL, NLL, entropy rate, and $n$-gram diversity as functions of the number of sampling steps for several discrete diffusion samplers, using an exact oracle posterior to isolate sampler-induced error. For all samplers, substantial transition-level error persists at small step counts, with convergence to the autoregressive baseline occurring only when the number of steps approaches the sequence length $T$ (except LLaDA). Notably, LLaDA attains very low sequence NLL at few steps, despite severe degradation in transition KL, entropy, and diversity, demonstrating that NLL alone is not a reliable indicator of distributional correctness. All samplers show zero exact duplication.
  • Figure 2: Temperature-induced score sharpening in SEDD under an oracle denoiser (OWT). As $\beta$ increases, transition KL rises and 3-gram diversity falls, while sequence NLL and GenPPL decrease. MAUVE drops sharply at low temperatures, indicating severe distributional degradation. This reveals a misalignment between likelihood-based metrics and transition-level correctness. Duplication remains negligible except under extreme sharpening (Appendix \ref{['app:sedd_extreme']}).
  • Figure 3: GenPPL under controlled local sharpening in an autoregressive bigram generator (OWT). We introduce a sharpening factor $\beta$ ($\beta=1$ corresponds to accurate sampling) and evaluate fixed samples using a pretrained GPT-2 Large model. As $\beta$ increases, GenPPL decreases monotonically, while 3-gram diversity collapses and sentence entropy steadily declines, indicating increasing concentration of probability mass. Exact duplication remains negligible until extreme sharpening.
  • Figure 4: Step-wise evaluation of oracle ReMDM variants on OWT. Across transition-level metrics, the ReMDM-loop sampler exhibits larger deviations from the AR baseline than ReMDM-conf, indicating higher sampler error, while ReMDM-conf remains consistently closer to the oracle transition kernel. Applying nucleus sampling reduces transition KL, NLL, and entropy, indicating improved alignment with the oracle kernel, but decreases the support fraction due to truncation of low-probability tail transitions. External MAUVE scores remain high and vary only mildly across diffusion steps. A full breakdown of all reported metrics is provided in Appendix Figure \ref{['fig:remdm_owt_allmetrics']}.
  • Figure 5: Distribution of effective sparsity $k_i^{*}$ in OpenWebText (OWT) under the 99% cumulative mass criterion. The distribution is strongly right-skewed: most tokens require only a small number of successors, while a small fraction exhibit heavy-tailed behavior. The chosen global sparsity level $K=206$ corresponds to the 90th percentile of this distribution.
  • ...and 4 more figures