Table of Contents
Fetching ...

Generalized Discrete Diffusion from Snapshots

Oussama Zekri, Théo Uscidda, Nicolas Boullé, Anna Korba

Abstract

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.

Generalized Discrete Diffusion from Snapshots

Abstract

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.
Paper Structure (83 sections, 130 equations, 6 figures, 13 tables, 5 algorithms)

This paper contains 83 sections, 130 equations, 6 figures, 13 tables, 5 algorithms.

Figures (6)

  • Figure 1: Zero-shot transfer of OWT-trained models. Zero-shot perplexity ($\downarrow$) on three representative downstream validation sets from \ref{['tbl:owt_zero_shot']}: PTB, LM1B, and Wikitext. Across this high-to-low perplexity range, $\textsc{GDDS}$ Gauss consistently achieves the lowest transfer perplexity, highlighting the stronger generalization capability induced by semantically structured noising processes.
  • Figure 2: Overview of GDDS. A clean sequence $\mathbf{x}_0$ is first noised exactly by the forward CTMC at a sampled time $t\in[0,1]$, yielding a snapshot sequence $(\mathbf{x}_t,t)$. The mean parametrization is then used as a denoiser: given the snapshot, the model predicts the clean-token posterior directly from $(\mathbf{x}_t,t)$, so training is performed on snapshots rather than through a full path-wise objective.
  • Figure 3: Snapshot vs. path-wise training. The forward process corrupts the clean sequence "My name is David". The blue path shows the beginning of the noising trajectory $\textcolor{bluegray}{\omega}={\{(\textcolor{bluegray}{x_{t_k}^{(4)}},\textcolor{bluegray}{t_k})\}_{k\ge1}}$ of one tracked position ($\ell=4$). Path-wise objectives condition on the entire trajectory $\textcolor{bluegray}{\omega}$, whereas our GDDS snapshot objective uses only one random-time observation $\textcolor{red}{s}=(\textcolor{red}{x_{t^\star}},\textcolor{red}{t^\star})$.
  • Figure 4: OWT training curves. Evolution of OWT validation perplexity during training for the retrained models reported in \ref{['tbl:owt_ppl']}. This complements the final numbers in \ref{['tbl:owt_ppl']} by showing the full optimization trajectory; both axes are shown on logarithmic scales.
  • Figure 5: Generation quality-diversity tradeoff. Gen-PPL ($\downarrow$) vs Entropy tradeoff. For $K\in\{32,64,128,256,512,1024\}$ decoding steps, we plot the generative perplexity of $N_{\text{gen}}=256$ unconditional samples under a fixed evaluator (GPT2-large) against their sequence entropy (higher is better). Bubble radius increases with $K$. For reference, the AR baseline achieves Gen-PPL $56.82$ at entropy $5.60$.
  • ...and 1 more figures

Theorems & Definitions (13)

  • proof : Proof of \ref{['prop:rate_matrix']}
  • proof
  • proof : Proof of \ref{['prop:mixing_matrix']}
  • Remark 2.1
  • proof : Proof of \ref{['prop:mixing_matrix_uniformization']}
  • Remark 2.2: Exact sampling of $x_t\sim q_t(\cdot\mid x_0)$
  • proof
  • proof
  • proof
  • proof
  • ...and 3 more