Table of Contents
Fetching ...

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Cheng Zhang

TL;DR

This work addresses the challenge of modeling complex inter-dimensional dependencies in discrete diffusion models by introducing Variational Autoencoding Discrete Diffusion (VADD). VADD integrates a latent variable $m{z}$ into the denoising distribution and optimizes a Double ELBO (DELBO) via a recognition model, enabling implicit capture of correlations across dimensions while preserving fast, parallel sampling. A consistency sampler (VADD-CS) stabilizes generation by fixing $m{z}$ across backward steps. Across 2D toy, image, and text tasks, VADD consistently improves sample quality over Masked Diffusion Models (MDMs), especially with few denoising steps, and achieves strong perplexities and image metrics on challenging datasets.

Abstract

Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

TL;DR

This work addresses the challenge of modeling complex inter-dimensional dependencies in discrete diffusion models by introducing Variational Autoencoding Discrete Diffusion (VADD). VADD integrates a latent variable into the denoising distribution and optimizes a Double ELBO (DELBO) via a recognition model, enabling implicit capture of correlations across dimensions while preserving fast, parallel sampling. A consistency sampler (VADD-CS) stabilizes generation by fixing across backward steps. Across 2D toy, image, and text tasks, VADD consistently improves sample quality over Masked Diffusion Models (MDMs), especially with few denoising steps, and achieves strong perplexities and image metrics on challenging datasets.

Abstract

Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.

Paper Structure

This paper contains 33 sections, 2 theorems, 28 equations, 15 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

For all $\bm{x}_0$, it holds that $\mathcal{L}^{\mathrm{CS}}(\bm{x}_0;\bm{\theta})\leq \mathcal{L}(\bm{x}_0;\bm{\theta})$.

Figures (15)

  • Figure 1: One-step samples of VADD and MDLM sahoo2024mdlm on 2D examples.
  • Figure 2: The network architecture of the denoising model and recognition model in VADD for text modeling. The feature dimensions of the tensors are marked in red font. @$\bm{M}$ means that the module is only applied to the positions $i$ satisfying $\bm{M}^i=1$.
  • Figure 3: Non-cherry-picked samples generated by different discrete diffusion models and sampling steps on the binarized MNIST dataset.
  • Figure 4: Generative perplexities ($\downarrow$) evaluated by a pre-trained GPT-2 large model based on 256 samples on OpenWebText. All model sizes correspond to GPT-2 small.
  • Figure 5: Histplots of the ground truth and the samples generated from different models and sampling steps on the 2D toy example.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Remark
  • Proposition 1
  • proof
  • Proposition 2
  • proof