Glauber Generative Model: Discrete Diffusion Models via Binary Classification
Harshit Varma, Dheeraj Nagaraj, Karthikeyan Shanmugam
TL;DR
The paper tackles discrete generative modeling by introducing Glauber Generative Model (GGM), which uses time-dependent Glauber dynamics to denoise discrete token sequences.A key idea is reducing denoising to a sequence of binary classification tasks, enabling an exact reverse process via a transformer-based model and achieving linear scaling in the vocabulary size.Empirically, GGM delivers strong language-generation results relative to prior discrete diffusion models and demonstrates competitive image generation without dataset-specific tokenizers, with robust zero-shot infilling capabilities.While not yet surpassing state-of-the-art autoregressive LLMs or GAN-based image methods, the framework is principled, scalable, and shows potential for broad applications and extensions.
Abstract
We introduce the Glauber Generative Model (GGM), a new class of discrete diffusion models, to obtain new samples from a distribution given samples from a discrete space. GGM deploys a discrete Markov chain called the heat bath dynamics (or the Glauber dynamics) to denoise a sequence of noisy tokens to a sample from a joint distribution of discrete tokens. Our novel conceptual framework provides an exact reduction of the task of learning the denoising Markov chain to solving a class of binary classification tasks. More specifically, the model learns to classify a given token in a noisy sequence as signal or noise. In contrast, prior works on discrete diffusion models either solve regression problems to learn importance ratios, or minimize loss functions given by variational approximations. We apply GGM to language modeling and image generation, where images are discretized using image tokenizers like VQGANs. We show that it outperforms existing discrete diffusion models in language generation, and demonstrates strong performance for image generation without using dataset-specific image tokenizers. We also show that our model is capable of performing well in zero-shot control settings like text and image infilling.
