DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
Yilun Xu, Gabriele Corso, Tommi Jaakkola, Arash Vahdat, Karsten Kreis
TL;DR
The paper addresses the challenge of learning multimodal data distributions with diffusion models by introducing Discrete-Continuous Latent Diffusion Models (DisCo-Diff) that couple a small set of learnable discrete latents with the standard continuous diffusion prior. It jointly learns a denoiser, an encoder for discrete latents, and an autoregressive latent prior in a two-stage, end-to-end framework, aided by Gumbel-Softmax relaxation and classifier-free guidance. Empirically, DisCo-Diff reduces ODE curvature, improves score matching, and delivers state-of-the-art or competitive results on class-conditioned ImageNet generation with ODE samplers and on molecular docking tasks, demonstrating cross-domain universality. The approach offers a practical, encoder-free conditioning mechanism that enhances fidelity while keeping computational overhead modest, suggesting broad applicability to diverse data modalities and diffusion-based generative models.
Abstract
Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.
