Improved Sampling Schedules for Discrete Diffusion Models
Alberto Foresti, Mustapha Bounoua, Giulio Franzese, Luca Ambrogioni, Pietro Michiardi
TL;DR
This work extends thermodynamic and geometric perspectives to discrete diffusion, introducing entropy production as a principled measure of information generation during reverse diffusion and proving a Wasserstein-based speed limit for distributional transport. It derives a practical non-adiabatic entropy estimator using a neural score and proposes two intrinsically motivated sampling schedules, Entropic Discrete Schedule (EDS) and Wasserstein Discrete Schedule (WDS), that distribute timesteps uniformly in entropy or Wasserstein progress. The schedules require no additional training and improve generation quality across count data, music notation, vision, and language tasks at substantially lower compute budgets compared to baselines. Overall, the paper provides both a theoretical framework and a practical, modular approach to boosting the efficiency of discrete diffusion models by aligning sampling with the model’s intrinsic information and transport dynamics.
Abstract
Discrete diffusion models have emerged as a powerful paradigm for generative modeling on sequence data; however, the information-theoretic principles governing their reverse processes remain significantly less understood than those of their continuous counterparts. In this work, we bridge this gap by analyzing the reverse process dynamics through the lens of thermodynamic entropy production. We propose the entropy production rate as a rigorous proxy for quantifying information generation, deriving as a byproduct a bound on the Wasserstein distance between intermediate states and the data distribution. Leveraging these insights, we introduce two novel sampling schedules that are uniformly spaced with respect to their corresponding physics-inspired metrics: the Entropic Discrete Schedule (EDS), which is defined by maintaining a constant rate of information gain, and the Wasserstein Discrete Schedule (WDS), which is defined by taking equal steps in terms of the Wasserstein distance. We empirically demonstrate that our proposed schedules significantly outperform state-of-the-art strategies across diverse application domains, including synthetic data, music notation, vision and language modeling, consistently achieving superior performance at a lower computational budget.
