Think While You Generate: Discrete Diffusion with Planned Denoising
Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, Rafael Gómez-Bombarelli
TL;DR
This work addresses the limitations of denoiser-only discrete diffusion by introducing Discrete Diffusion with Planned Denoising (DDPD), a two-model framework consisting of a planner and a denoiser. The planner selects which token positions to denoise next, and the denoiser predicts their values, enabling an adaptive, order-aware sampling process implemented via a Gillespie-based sampler with time correction. The authors provide ELBO-based training objectives that allow independent optimization of the planner and denoiser, and demonstrate strong gains on GPT-2-scale language modeling (text8 and OpenWebText) and token-based ImageNet generation, outperforming conventional diffusion baselines. The approach reduces the gap to autoregressive methods in language modeling, improves sample quality and robustness, and offers practical benefits for reuse of pretrained denoisers and efficient inference. Overall, DDPD presents a scalable, principled path to more capable discrete generative models through strategic planning of denoising steps.
Abstract
Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based image generation on ImageNet $256 \times 256$. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at https://github.com/liusulin/DDPD.
