Table of Contents
Fetching ...

Bit-Level Discrete Diffusion with Markov Probabilistic Models: An Improved Framework with Sharp Convergence Bounds under Minimal Assumptions

Le-Tuyet-Nhi Pham, Dario Shariatian, Antonio Ocello, Giovanni Conforti, Alain Durmus

TL;DR

This work extends score-based generative modeling to discrete data by formulating a forward CTMC on the hypercube $\{0,1\}^d$ and a tractable time-reversed process. A discrete score, defined as a conditional expectation, is learned via an $\mathrm{L}^2$ projection with a denoiser-based reparameterization, enabling stable training with a simple regression target. The authors prove non-asymptotic convergence bounds for DMPMs under minimal assumptions, showing linear-in-dimension sampling error and providing early-stopping refinements to tighten the bound; they also demonstrate competitive performance on low- and high-dimensional discrete data, including binarized MNIST, with efficient sampling. The combination of a principled time-reversal derivation, a practical training objective, and rigorous convergence guarantees yields a scalable and theoretically grounded framework for discrete generative modeling with real-world impact on discrete structure synthesis.

Abstract

This paper introduces Discrete Markov Probabilistic Models (DMPMs), a novel discrete diffusion algorithm for discrete data generation. The algorithm operates in discrete bit space, where the noising process is a continuous-time Markov chain that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, underlining theoretical alignment with score-based generative models. We establish convergence bounds for the algorithm under minimal assumptions, ensuring robustness and efficiency, which we demonstrate through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight competitive performance in generating discrete structures compared to the state-of-the-art. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.

Bit-Level Discrete Diffusion with Markov Probabilistic Models: An Improved Framework with Sharp Convergence Bounds under Minimal Assumptions

TL;DR

This work extends score-based generative modeling to discrete data by formulating a forward CTMC on the hypercube and a tractable time-reversed process. A discrete score, defined as a conditional expectation, is learned via an projection with a denoiser-based reparameterization, enabling stable training with a simple regression target. The authors prove non-asymptotic convergence bounds for DMPMs under minimal assumptions, showing linear-in-dimension sampling error and providing early-stopping refinements to tighten the bound; they also demonstrate competitive performance on low- and high-dimensional discrete data, including binarized MNIST, with efficient sampling. The combination of a principled time-reversal derivation, a practical training objective, and rigorous convergence guarantees yields a scalable and theoretically grounded framework for discrete generative modeling with real-world impact on discrete structure synthesis.

Abstract

This paper introduces Discrete Markov Probabilistic Models (DMPMs), a novel discrete diffusion algorithm for discrete data generation. The algorithm operates in discrete bit space, where the noising process is a continuous-time Markov chain that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, underlining theoretical alignment with score-based generative models. We establish convergence bounds for the algorithm under minimal assumptions, ensuring robustness and efficiency, which we demonstrate through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight competitive performance in generating discrete structures compared to the state-of-the-art. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.

Paper Structure

This paper contains 51 sections, 24 theorems, 311 equations, 8 figures, 3 tables, 5 algorithms.

Key Result

Proposition 1.1

The score function can be expressed as a conditional expectation: where $t \in [0,T_f)$, $x\in \mathsf{X}$, $\ell = 1,\ldots,d$, $s_t^{\ell}$ is the $\ell$-th component of the score function $s_t$, and

Figures (8)

  • Figure 1: Comparison of time-schedules (cosine, linear, quadratic) and time horizon ($T_f=3$ vs. $T_f=10$).
  • Figure 2: $\text{SWD} \downarrow$, in 1e-3, for DMPM, MD4, and DFM across data dimension $d$. Selected the best result with #steps $2\leqslant K \leqslant 200$ for each method.
  • Figure 3: FID$\downarrow$ on MNIST, linear vs. constant flip-schedules scaled for $d$ total bit flips, with various loss configurations.
  • Figure 4: Comparison of $\mathfrak{L}_{\mathrm{L}^2}, \mathfrak{L}_{\text{CE}}, \mathfrak{L}_{\mathrm{L}^2}^{w}, \mathfrak{L}_{\text{CE}}^w$ average losses over timesteps. The two losses become scaled version of one another only when averaged over data, but otherwise benefit from positive synergies when mixed together.
  • Figure 5: FID$\downarrow$, on MNIST, for models trained with $\mathfrak{L}_{\varpi}$ and $\mathfrak{L}_{\varpi}^{w}$ losses, evaluated using $200$ reverse steps with the denoise-renoise sampler. Scaling with $w$ yields consistent improvements, with the best loss configuration $\mathfrak{L}_{1/3, 1/3, 1/3}^w$ involving all the methodological improvements we discussed.
  • ...and 3 more figures

Theorems & Definitions (54)

  • Proposition 1.1
  • Theorem 2.3
  • Theorem 2.4
  • Corollary 2.5
  • Theorem 2.6
  • Proposition 2.7
  • Corollary 2.8
  • proof : Detailed calculation of the transition probability in \ref{['eq:def_density_one_dim_transition']}
  • proof : Proof of \ref{['eq:transition_d']}
  • proof : Proof of \ref{['prop:1']}
  • ...and 44 more