Table of Contents
Fetching ...

In-Situ Tweedie Discrete Diffusion Models

Xiao Li, Jiaqi Zhang, Shuxiang Zhang, Tianshui Chen, Liang Lin, Guangrun Wang

TL;DR

This work introduces in-situ Tweedie Discrete Diffusion (TDD), a principled framework for diffusion in discrete one-hot spaces that preserves Tweedie’s diffusion guarantees while operating directly on categorical data. TDD corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy loss, using $\arg\max$ discretization at inference and re-noising to progressively refine predictions. The approach supports both single-token tasks (e.g., image classification) and multi-token generation (e.g., image-conditioned text generation) with conditioning from feature extractors and a classifier-free guidance mechanism. Extensive ablations demonstrate the necessity of the cross-entropy alignment, timestep-conditioned coefficients, CFG, and the discrete sampling strategy, and experiments on ImageNet and COCO show strong performance with relatively few sampling steps. Overall, TDD provides a robust, efficient, and principled discrete-diffusion paradigm suitable for symbolic domains and downstream vision-language applications.

Abstract

While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.

In-Situ Tweedie Discrete Diffusion Models

TL;DR

This work introduces in-situ Tweedie Discrete Diffusion (TDD), a principled framework for diffusion in discrete one-hot spaces that preserves Tweedie’s diffusion guarantees while operating directly on categorical data. TDD corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy loss, using discretization at inference and re-noising to progressively refine predictions. The approach supports both single-token tasks (e.g., image classification) and multi-token generation (e.g., image-conditioned text generation) with conditioning from feature extractors and a classifier-free guidance mechanism. Extensive ablations demonstrate the necessity of the cross-entropy alignment, timestep-conditioned coefficients, CFG, and the discrete sampling strategy, and experiments on ImageNet and COCO show strong performance with relatively few sampling steps. Overall, TDD provides a robust, efficient, and principled discrete-diffusion paradigm suitable for symbolic domains and downstream vision-language applications.

Abstract

While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.

Paper Structure

This paper contains 42 sections, 15 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Comparison between (a) traditional continuous-space diffusion, (b) Mask-based Discrete Diffusion (MDD), and (c) our proposed in-situ Tweedie Discrete Diffusion (TDD) framework. In (c), the left panel illustrates single-token discrete generation (e.g., classification), while the right panel illustrates multi-token discrete generation (e.g., text generation). Continuous diffusion models operate in Gaussian space, performing noise prediction and MSE-based reconstruction. Mask-based discrete models mimic diffusion through masked token recovery. In contrast, TDD begins from Gaussian-corrupted one-hot vectors and performs denoising directly in the one-hot space. At each step, the model applies an $\mathop{\mathrm{arg\,max}}\limits$ to produce discrete predictions, converts them into one-hot vectors, and feeds them into the next iteration after adding noise with a reduced coefficient. This refinement process yields stable and efficient discrete-space diffusion, achieving accurate categorical predictions in only a few steps.
  • Figure 2: Overview of the proposed framework. (a) Single-token generation for classification. Ground-truth labels are converted into one-hot vectors and perturbed with Gaussian noise during training. The diffusion model iteratively denoises these corrupted vectors back to categorical one-hot outputs through the "to one" operation—implemented as softmax with timestep-conditioned cross-entropy supervision in training and as $\mathop{\mathrm{arg\,max}}\limits$ with one-hot vectorization during sampling. Image features provide conditioning signals that guide the denoising process. (b) Conditioning module. The feature extractor encodes the input image into tokens, where some learnable class tokens interacts with image tokens through stacked Transformer layers to yield the conditioning representation $c$, which is injected into diffusion blocks. (c) Multi-token generation for text. TDD extends naturally from single-label classification to sequence generation by applying Gaussian corruption and denoising to each token in a sequence of one-hot vectors. The diffusion blocks predict categorical distributions for all tokens in parallel, enabling efficient iterative refinement of entire sequences into coherent text.
  • Figure 3: Comparison of TDD and MDD Accuracy Across Epochs. We trained TDD and our state-of-the-art method on ImageNet for 300 epochs, tracking accuracy at each epoch. The state-of-the-art method began to overfit after 250 epochs, whereas our method continued to improve beyond 300 epochs. This justifies extending the training to 500 epochs, ultimately achieving optimal performance (Top-1 score = 82.8).
  • Figure 4: Performance Comparison Between MSE Loss and Cross-Entropy Loss. We compared the outputs generated by models trained with the two loss functions. Models trained with cross-entropy loss achieved high accuracy, while those trained with MSE loss showed unsatisfactory performance.
  • Figure 5: Performance test of TDD at different sampling steps. We conducted experiments with and without timestep-conditioned coefficients respectively, and tested the classification accuracy of TDD at different sampling steps. We found that TDD can achieve strong performance in about 10 sampling steps, and the best result is achieved after 20 iterations. However, without the timestep-conditioned coefficients, the iterative process exhibits slightly higher accuracy
  • ...and 5 more figures