Masked Diffusion Models as Energy Minimization
Sitong Chen, Shen Nie, Jiacheng Sun, Zijin Feng, Zhenguo Li, Ji-Rong Wen, Chongxuan Li
TL;DR
This work reframes Masked Diffusion Models as energy-minimizing discrete transport processes, unifying kinetic, conditional kinetic, and geodesic viewpoints. It proves an optimal mask-schedule condition $\alpha_t^* = \sin^2\left(\frac{\pi}{2}\gamma_t\right)$ that minimizes all three energies and links discrete masking to geodesic interpolation on probability space. To make scheduling practical, it introduces a Beta-CDF parameterization of $\gamma_t$, reducing the search to two dimensions and enabling post-training task-adaptive tuning without retraining. Empirical results on synthetic and real-world benchmarks show energy-inspired schedules can outperform hand-crafted baselines in few-step sampling, with task-dependent schedule preferences, particularly benefiting code generation and reasoning tasks.
Abstract
We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
