Think While You Generate: Discrete Diffusion with Planned Denoising

Sulin Liu; Juno Nam; Andrew Campbell; Hannes Stärk; Yilun Xu; Tommi Jaakkola; Rafael Gómez-Bombarelli

Think While You Generate: Discrete Diffusion with Planned Denoising

Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, Rafael Gómez-Bombarelli

TL;DR

This work addresses the limitations of denoiser-only discrete diffusion by introducing Discrete Diffusion with Planned Denoising (DDPD), a two-model framework consisting of a planner and a denoiser. The planner selects which token positions to denoise next, and the denoiser predicts their values, enabling an adaptive, order-aware sampling process implemented via a Gillespie-based sampler with time correction. The authors provide ELBO-based training objectives that allow independent optimization of the planner and denoiser, and demonstrate strong gains on GPT-2-scale language modeling (text8 and OpenWebText) and token-based ImageNet generation, outperforming conventional diffusion baselines. The approach reduces the gap to autoregressive methods in language modeling, improves sample quality and robustness, and offers practical benefits for reuse of pretrained denoisers and efficient inference. Overall, DDPD presents a scalable, principled path to more capable discrete generative models through strategic planning of denoising steps.

Abstract

Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based image generation on ImageNet $256 \times 256$. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at https://github.com/liusulin/DDPD.

Think While You Generate: Discrete Diffusion with Planned Denoising

TL;DR

Abstract

. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at https://github.com/liusulin/DDPD.

Paper Structure (46 sections, 4 theorems, 60 equations, 20 figures, 12 tables, 1 algorithm)

This paper contains 46 sections, 4 theorems, 60 equations, 20 figures, 12 tables, 1 algorithm.

Introduction
Preliminaries
Method
Decomposing Generation into Planning and Denoising
Sampling
Training
Related work
Experiment
Conclusion
Reproducibility statement
Ethics Statement
Proofs
Proof of \ref{['prop:generative_rate_decomposition']}
Proof of \ref{['thm:training_objective']}: Deriving the ELBO for training
Proof of \ref{['prop:self-loop-jumps']}: Continuous Time Markov Chains with Self-Connections
...and 31 more sections

Key Result

Proposition 3.1

The reverse generative rate at $d$-th dimension can be decomposed into the product of recovery rate, probability of corruption and probability of denoising:

Figures (20)

Figure 1: An example generation trajectory from $t=0$ to $1$ of a $5$-letter sequence. At each step, the planner estimates the probability of token corruption (as indicated by the numbers), selects a position, and the denoiser predicts the token. The actual time progression --- adaptively re-calibrated by the planner’s noise level assessment --- may deviate from the scheduled timestep. For instance, in step $2$, minimal improvement results in slower time progression, while in step $4$, an error causes a backward step in time. Sampling continues until all corrupted tokens are reconstructed.
Figure 2: Negative log-likelihood measured with GPT-J versus sample entropy (in terms of tokens), with logit temperatures of the denoiser swept over $\{0.8, 0.9, 1.0\}$. Left: DDPD v.s. SOTA baselines. Middle: Varying sampling steps from $256$ to $1024$; both DFM and DDPD use the same mask-based denoiser. Right: DDPD single-neural-network v.s. DDPD planner + mask denoiser, both trained for $20k$ iterations. DFM at $750k$ iter.
Figure 3: Generative perplexity $\downarrow$ v.s. entropy $\uparrow$ (both plotted in log-scale) of SEDD, DDPD and GPT-2.
Figure 4: Gillespie Algorithm sampling loop with uniform diffusion denoiser (DFM-Uni) decomposed as a planner and a denoiser.
Figure 5: DDPD sampling loop with a separate planner and a denoiser.
...and 15 more figures

Theorems & Definitions (6)

Proposition 3.1
Remark 3.2
Proposition 3.3
Remark 3.4
Proposition 3.5
Theorem 4.1

Think While You Generate: Discrete Diffusion with Planned Denoising

TL;DR

Abstract

Think While You Generate: Discrete Diffusion with Planned Denoising

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (6)