Table of Contents
Fetching ...

Why Gaussian Diffusion Models Fail on Discrete Data?

Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov

Abstract

Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

Why Gaussian Diffusion Models Fail on Discrete Data?

Abstract

Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

Paper Structure

This paper contains 61 sections, 1 theorem, 25 equations, 16 figures, 10 tables.

Key Result

Theorem 1

Let $\mathbf{x}_t \sim p(\mathbf{x}_t)$ and define $\mathbf{V}_t := \mathbb{E}[\mathrm{Var}[\mathbf{x}_0 | \mathbf{x}_t]]$, the average posterior uncertainty. ThenWe operate with covariance matrices, so $\mathbf{A}\ge \mathbf{B}$ denotes that $\mathbf{A}-\mathbf{B}$ is positive semidefinite. Moreover, both underestimate the true forward marginal variance: $\blacktriangleleft$$\blacktriangleleft$

Figures (16)

  • Figure 1: Toy visualization of sampling with continuous diffusion models in continuous and discrete data domains. Orange contours visualize the density of noisified data $p(\textbf{x}_t)$. Blue points and arrows show in-distribution sampling trajectories, red --- out-of-distribution trajectories. White diamonds show original data points in the discrete setting.
  • Figure 2: (a) correctness of predictions $\hat{\mathbf{x}}_\theta(\mathbf{x}_t)$ and (b) mean probability density $p(\mathbf{x}_{t})$ along correct, incorrect and optimal sampling trajectories; (c) share of pairs of trajectories lying within the same mode (separately for all learned and optimal trajectories). Gray vertical stripes show the transition interval.
  • Figure 3: left: ratio of candidate sets $\mathcal{S}$ with at least one correct sample vs. its size $|\mathcal{S}|$ for different activation timesteps $t_{\mathrm{act}}$; right: correctness and diversity of MBR with $|\mathcal{S}|=5$, self-conditioning (SC), and q-sampling vs. their activation timestep $t_{\mathrm{act}}$. Gray vertical stripes show the transition interval.
  • Figure 4: Comparison of one-step updates for DDPM and q-sampling. For each plotted timestep $t$, we run DDPM from $t'=T$ to $t'=t$. Then, we divide the trajectories into correct and incorrect based on $\hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t})$. The next step $t'=t-1$ is obtained with either DDPM or q-sampling. The plots show (a) correctness of predictions $\hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t-1})$, (b) mean distance $\|\mathbf{x}_{t} - \mathbf{x}_{t-1}\|$, (c) mean density $p(\mathbf{x}_{t-1})$, (d) mean magnitude of predictions $\|\hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t-1})\|$.
  • Figure 5: (a) mean distance between predictions $\|\hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t}) - \hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t-1})\|$ for correct and incorrect trajectories without SC and for the optimal model; (b) mean distance between predictions $\|\hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t}) - \hat{\mathbf{x}}_{\theta}(\mathbf{x}_{t-1})\|$ with and without SC; (c) correctness along the trajectory with and without SC. Gray vertical stripes show the transition interval.
  • ...and 11 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Remark 1
  • Remark 2