Table of Contents
Fetching ...

DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking

Gilad Turok, Chris De Sa, Volodymyr Kuleshov

TL;DR

The DUEL framework is introduced, which formalizes position selection and gives MDMs proper perplexity for the first time, an analysis impossible with the ELBO and unreliable with generative perplexity, and enables the first principled comparison of fast, parallel sampler across compute budgets.

Abstract

Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textsc{DUEL} framework, which formalizes \emph{deterministic} position selection, unifying leading MDM sampling strategies. We prove \textbf{\textsc{DUEL} admits \emph{exact} likelihood computation} via a simple algorithm, evaluated under the same position selection used at test time. This \textbf{gives MDMs proper perplexity for the first time} -- the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbf{MDMs are substantially better than previously thought}: the MDM-autoregressive perplexity gap shrinks by up to 32\% on in-domain data and 82\% on zero-shot benchmarks. \textsc{DUEL} enables the first principled comparison of fast, parallel samplers across compute budgets -- an analysis impossible with the ELBO and unreliable with generative perplexity -- identifying probability margin \citep{kim2025train} as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models -- achieving 36.47 vs.\ 52.11 perplexity on AG News -- demonstrating the ceiling of MDM performance has not yet been reached.

DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking

TL;DR

The DUEL framework is introduced, which formalizes position selection and gives MDMs proper perplexity for the first time, an analysis impossible with the ELBO and unreliable with generative perplexity, and enables the first principled comparison of fast, parallel sampler across compute budgets.

Abstract

Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textsc{DUEL} framework, which formalizes \emph{deterministic} position selection, unifying leading MDM sampling strategies. We prove \textbf{\textsc{DUEL} admits \emph{exact} likelihood computation} via a simple algorithm, evaluated under the same position selection used at test time. This \textbf{gives MDMs proper perplexity for the first time} -- the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbf{MDMs are substantially better than previously thought}: the MDM-autoregressive perplexity gap shrinks by up to 32\% on in-domain data and 82\% on zero-shot benchmarks. \textsc{DUEL} enables the first principled comparison of fast, parallel samplers across compute budgets -- an analysis impossible with the ELBO and unreliable with generative perplexity -- identifying probability margin \citep{kim2025train} as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models -- achieving 36.47 vs.\ 52.11 perplexity on AG News -- demonstrating the ceiling of MDM performance has not yet been reached.
Paper Structure (97 sections, 7 theorems, 41 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 97 sections, 7 theorems, 41 equations, 5 figures, 8 tables, 2 algorithms.

Key Result

Theorem 4.2

If the denoising network $x_\theta$ is order-sensitive---its predictions at masked positions depend on revealed context---then there exist unmasking rules $F_1, F_2$ inducing different distributions: $p_\theta^{\pi^{F_1}} \neq p_\theta^{\pi^{F_2}}$. (Proof in app:duel-properties.)

Figures (5)

  • Figure 1: One step of MDM generation. Given a partially masked sequence, the unmasking policy $\pi$ performs position selection (choosing position 2), and the denoising network $p_\theta$ performs token prediction at that position (sampling "big"). This repeats until no masked positions remain.
  • Figure 2: Unmasking trajectory for $\sigma = (\{1,3\}, \{2\}, \{4\})$. Starting from a fully masked sequence ($t{=}0$), positions are progressively revealed according to the ordered partition. At $t{=}1$, positions 1 and 3 are unmasked in parallel; subsequent steps unmask one position each.
  • Figure 3: DUEL: Sampling
  • Figure 4: Comparing fast samplers.Top:DUEL perplexity. Bottom: Generative perplexity. Dashed line: ELBO ($\leq$23.52). DUEL yields consistent rankings across NFE (probability margin best at low NFE, convergence at high NFE). Generative perplexity rankings cross repeatedly, making it unreliable.
  • Figure 5: Comparing fast samplers across four metrics. Dashed line in DUEL perplexity: ELBO ($\leq$23.52). DUEL perplexity yields consistent rankings across NFE (probability margin best at low NFE), while generative perplexity rankings cross repeatedly. Entropy reveals that left-to-right produces degenerate low-entropy text at low NFE despite achieving favorable generative perplexity. MAUVE saturates near zero at low NFE for all rules, providing little discriminative signal.

Theorems & Definitions (17)

  • Definition 3.1: Unmasking Rule
  • Definition 3.2: DUEL Sampler
  • Definition 4.1: Ordered Partition
  • Theorem 4.2: Policy-Dependent Distribution
  • Theorem 4.3: DUEL Exact Likelihood
  • Proposition 3.2: Joint Factorization
  • proof
  • Proposition 3.3: Uniform Proposal Distribution
  • proof
  • Proposition 3.4: ELBO Training Objective
  • ...and 7 more