Table of Contents
Fetching ...

On the $ε$-Free Inference Complexity of Absorbing Discrete Diffusion

Xunpeng Huang, Yingyu Lin, Nishant Jain, Kaibo Wang, Difan Zou, Yian Ma, Tong Zhang

TL;DR

AATU is introduced and it is proved that AATU achieves TV convergence with complexity-independent of the error tolerance $\epsilon$-thereby strictly outperforming existing uniform baselines, and eliminates the restrictive bounded-score assumption commonly required in prior studies of uniformization-based inference.

Abstract

Absorbing discrete diffusion has emerged as a dominant framework for discrete data generation. However, a significant disparity remains between its empirical success and theoretical understanding: existing analyses fail to demonstrate a complexity advantage over the $\mathcal{O}(d \ln(d/ε))$ baseline established for \emph{uniform} discrete diffusion. We bridge this gap by identifying a critical structural advantage: whereas uniform diffusion redundantly re-denoises valid elements, the absorbing scheme denoises each absorbing state exactly once. Leveraging this insight, we introduce \emph{Absorbing-Aware Truncated Uniformization} (AATU). We prove that AATU achieves $ε$-TV convergence with $\mathcal{O}(d \ln d)$ complexity-\emph{independent} of the error tolerance $ε$-thereby strictly outperforming existing uniform baselines. Beyond improving convergence rates, our analysis eliminates the restrictive bounded-score assumption commonly required in prior studies of uniformization-based inference. Furthermore, we extend AATU to time-invariant parameterizations, showing that it naturally adopts an imputation-type inference with a uniformly randomized denoising order. When combined with a lazy update strategy, TV convergence requires only $\mathcal{O}(d)$ discrete score evaluations. These results not only establish a rigorous foundation for absorbing discrete diffusion -- confirming its efficiency in high-accuracy generation -- but also open new avenues for analyzing diffusion-based language models under the masking paradigm.

On the $ε$-Free Inference Complexity of Absorbing Discrete Diffusion

TL;DR

AATU is introduced and it is proved that AATU achieves TV convergence with complexity-independent of the error tolerance -thereby strictly outperforming existing uniform baselines, and eliminates the restrictive bounded-score assumption commonly required in prior studies of uniformization-based inference.

Abstract

Absorbing discrete diffusion has emerged as a dominant framework for discrete data generation. However, a significant disparity remains between its empirical success and theoretical understanding: existing analyses fail to demonstrate a complexity advantage over the baseline established for \emph{uniform} discrete diffusion. We bridge this gap by identifying a critical structural advantage: whereas uniform diffusion redundantly re-denoises valid elements, the absorbing scheme denoises each absorbing state exactly once. Leveraging this insight, we introduce \emph{Absorbing-Aware Truncated Uniformization} (AATU). We prove that AATU achieves -TV convergence with complexity-\emph{independent} of the error tolerance -thereby strictly outperforming existing uniform baselines. Beyond improving convergence rates, our analysis eliminates the restrictive bounded-score assumption commonly required in prior studies of uniformization-based inference. Furthermore, we extend AATU to time-invariant parameterizations, showing that it naturally adopts an imputation-type inference with a uniformly randomized denoising order. When combined with a lazy update strategy, TV convergence requires only discrete score evaluations. These results not only establish a rigorous foundation for absorbing discrete diffusion -- confirming its efficiency in high-accuracy generation -- but also open new avenues for analyzing diffusion-based language models under the masking paradigm.

Paper Structure

This paper contains 44 sections, 21 theorems, 244 equations, 2 figures, 3 tables, 2 algorithms.

Key Result

Lemma 2.1

The probability mass function $q^\gets_t$ in the reverse process follows and the reverse transition function ${\bm{R}}^\gets_t$ arises as the infinitesimal operator of the reverse process: while the outgoing rate is $R^\gets_t({\bm{y}}^\prime) = \sum_{{\bm{y}}\neq{\bm{y}}^\prime} R^\gets_t({\bm{y}},{\bm{y}}^\prime).$

Figures (2)

  • Figure 1: Synthetic experiment results on sampling efficiency. We compare our proposed Masked Discrete Diffusion (MASK) against the Uniform baseline with vocabulary size $K=3$ and sequence length $d=4$. Left: The Total Variation (TV) distance between the empirical and ground truth distributions as a function of the Number of (Score) Function Evaluations (NFE). The solid lines represent the mean over 5 seeds, and shaded regions indicate the standard deviations. Our method achieves faster convergence to the target distribution. Right: Violin plots illustrating the distribution of Stopping NFE. The MASK method requires significantly fewer evaluations to terminate compared to the Uniform baseline.
  • Figure 2: Visualization of individual sampling trajectories. The plots show single sampling paths, with labels indicating the intermediate discrete states. The MASK method (top) navigates the state space efficiently with few steps. In contrast, the Uniform baseline (bottom) exhibits diffusive behavior with many small steps—often reverting previous changes—resulting in a high NFE cost.

Theorems & Definitions (39)

  • Lemma 2.1
  • Lemma 3.1: Exponentially decreasing KL divergence between $q^\to_t$ and $\tilde{q}_t$
  • Lemma 4.1: Bound of the outgoing rate
  • Theorem 4.2: Combination of Theorem \ref{['thm:convergence_unif_reverse']} and Theorem \ref{['thm:mask_unif_complexity']}
  • Corollary 4.3
  • Theorem 5.1: Convergence of Alg. \ref{['alg:dlm_imple']}
  • proof
  • Lemma C.1
  • proof
  • Lemma C.2
  • ...and 29 more