Table of Contents
Fetching ...

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Samuel Lavoie, Michael Noukhovitch, Aaron Courville

TL;DR

This work tackles the difficulty of achieving high-fidelity, productive diffusion-model generation on broad image distributions by conditioning on a learned, discrete image representation. It introduces Discrete Latent Code (DLC), a sequence $\mathbf c = (T_1,...,T_L)$ of discrete tokens derived from Simplicial Embeddings, enabling a factorization $p(\bm x) = \sum_{\bm c} p(\bm x|\bm c) p(\bm c)$ and training a discrete diffusion model to sample $p(\bm c)$. DLC-conditioned diffusion achieves state-of-the-art unconditional ImageNet generation, supports compositional generation by mixing DLC tokens, and enables a text-to-DLC pipeline that maps text prompts to DLCs via a pretrained diffusion-language model, producing novel images outside the training distribution. The results advocate for discrete, compositional representations as a principled path to more capable diffusion models and suggest a scalable route to integrating language models for text-to-image generation through DLCs.

Abstract

We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

TL;DR

This work tackles the difficulty of achieving high-fidelity, productive diffusion-model generation on broad image distributions by conditioning on a learned, discrete image representation. It introduces Discrete Latent Code (DLC), a sequence of discrete tokens derived from Simplicial Embeddings, enabling a factorization and training a discrete diffusion model to sample . DLC-conditioned diffusion achieves state-of-the-art unconditional ImageNet generation, supports compositional generation by mixing DLC tokens, and enables a text-to-DLC pipeline that maps text prompts to DLCs via a pretrained diffusion-language model, producing novel images outside the training distribution. The results advocate for discrete, compositional representations as a principled path to more capable diffusion models and suggest a scalable route to integrating language models for text-to-image generation through DLCs.

Abstract

We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

Paper Structure

This paper contains 17 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Selected samples generated from a DiT-XL/2 with DLC$_{512}$ for both in-distribution and out-of-distribution (OOD). Model trained on ImageNet $256\times 256$ conditioned on a Discrete Latent Code of $512$ tokens. Left: Samples from unconditional generation. Right: OOD samples of semantic compositional generation by conditioning on diverse compositions of two DLCs corresponding to (1) jellyfish and mushroom, (2) komodor and carbonara and (3) tabby cat and golden retriever.
  • Figure 2: Unconditional diffusion gets worse at fitting a distribution as the number of modes increases. (a) Samples from the training data distribution $p_\text{data}$ with 121 mixtures. (b) Samples from unconditional diffusion model trained with $p_\text{data}$. (c) Samples from conditional diffusion model trained trained with $p_\text{data}$ and the ground-truth mixture index. (d) KL divergence between $p_\text{data}$ and the modeled distribution $p_\theta$ as we increase the number of mixtures. Unconditional's fit of the distribution degrades as the number of modes increase. Generations conditioned on an oracle index representing the mixture centroid and generation conditioned on an index inferred from a Gaussian Mixture Model (GMM) have good fit to highly modal distributions. (e, f) Heatmaps of the magnitude of the estimated score $s_\theta$ and vector fields with respect to the coordinate for the unconditional and the conditional generative models respectively. For d), we condition the score network on mixture index $c=6$.
  • Figure 3: Discrete Latent Codes (DLCs) are Top Left: the output of a finetuned DINOv2 with SEM, followed by an argmax over the vocabulary. Top Right: we can generate semantically compositional images from a composition of two DLCs by selecting tokens from either code. Bottom Left: we enable text-to-image generation by finetuning a text diffusion model for text-to-DLC sampling. Bottom Right: we sample unconditionally by first sampling a DLC with SEDD then conditionally sampling an image with DiT.
  • Figure 4: DLC greatly improves training efficiency for FID without CFG on ImageNet. Evaluating FID w/o CFG during intermediate steps, DLC is already improving on vanilla DiT performance at 1/4 of the steps. Baseline numbers taken from yu_representation_2025
  • Figure 5: Scaling analysis of DLC: trade-off between performance and compute controlled via the sequence length. (a) FID with respect to compute : FID and compute scale with the sequence length of DLE. (b) and (c) Training a generative model to generate long sequence length and training an image generative model conditioned on long sequence length converge to a lower FID. (d) Larger sequence length are more sensitive to the model size and attain lower FID. Results obtained without CFG nor remasking. Unless mentioned otherwise, DiTs are trained for 500 epochs.
  • ...and 7 more figures