Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
Samuel Lavoie, Michael Noukhovitch, Aaron Courville
TL;DR
This work tackles the difficulty of achieving high-fidelity, productive diffusion-model generation on broad image distributions by conditioning on a learned, discrete image representation. It introduces Discrete Latent Code (DLC), a sequence $\mathbf c = (T_1,...,T_L)$ of discrete tokens derived from Simplicial Embeddings, enabling a factorization $p(\bm x) = \sum_{\bm c} p(\bm x|\bm c) p(\bm c)$ and training a discrete diffusion model to sample $p(\bm c)$. DLC-conditioned diffusion achieves state-of-the-art unconditional ImageNet generation, supports compositional generation by mixing DLC tokens, and enables a text-to-DLC pipeline that maps text prompts to DLCs via a pretrained diffusion-language model, producing novel images outside the training distribution. The results advocate for discrete, compositional representations as a principled path to more capable diffusion models and suggest a scalable route to integrating language models for text-to-image generation through DLCs.
Abstract
We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.
