A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective
Gen Li, Changxiao Cai
TL;DR
<3-5 sentence high-level summary> This work provides the first information-theoretic convergence guarantees for diffusion language models, linking the KL sampling error after T iterations to the mutual information among tokens and a training error from mask predictors. By decoupling training and sampling and analyzing a broad class of forward masking schedules, the authors prove an O(1/T) decay in KL distance with a tight, matching lower bound, demonstrating that the rate is information-theoretically optimal up to constants. The bounds explicitly quantify how token dependencies in language data control sampling difficulty and show that balanced masking schedules yield favorable convergence behavior. These results offer theoretical foundations for the practical effectiveness of diffusion-based, parallel-token sampling in language generation and guide design choices for mask-predictor training and schedule selection.
Abstract
Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models enable parallel token sampling, leading to faster generation and eliminating left-to-right generation constraints. Despite their empirical success, the theoretical understanding of diffusion model approaches remains underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations $T$ and scales linearly with the mutual information between tokens in the target text sequence. In particular, we establish matching upper and lower bounds, up to some constant factor, to demonstrate the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.
