In-Situ Tweedie Discrete Diffusion Models
Xiao Li, Jiaqi Zhang, Shuxiang Zhang, Tianshui Chen, Liang Lin, Guangrun Wang
TL;DR
This work introduces in-situ Tweedie Discrete Diffusion (TDD), a principled framework for diffusion in discrete one-hot spaces that preserves Tweedie’s diffusion guarantees while operating directly on categorical data. TDD corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy loss, using $\arg\max$ discretization at inference and re-noising to progressively refine predictions. The approach supports both single-token tasks (e.g., image classification) and multi-token generation (e.g., image-conditioned text generation) with conditioning from feature extractors and a classifier-free guidance mechanism. Extensive ablations demonstrate the necessity of the cross-entropy alignment, timestep-conditioned coefficients, CFG, and the discrete sampling strategy, and experiments on ImageNet and COCO show strong performance with relatively few sampling steps. Overall, TDD provides a robust, efficient, and principled discrete-diffusion paradigm suitable for symbolic domains and downstream vision-language applications.
Abstract
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
