Table of Contents
Fetching ...

Masked Diffusion Models as Energy Minimization

Sitong Chen, Shen Nie, Jiacheng Sun, Zijin Feng, Zhenguo Li, Ji-Rong Wen, Chongxuan Li

TL;DR

This work reframes Masked Diffusion Models as energy-minimizing discrete transport processes, unifying kinetic, conditional kinetic, and geodesic viewpoints. It proves an optimal mask-schedule condition $\alpha_t^* = \sin^2\left(\frac{\pi}{2}\gamma_t\right)$ that minimizes all three energies and links discrete masking to geodesic interpolation on probability space. To make scheduling practical, it introduces a Beta-CDF parameterization of $\gamma_t$, reducing the search to two dimensions and enabling post-training task-adaptive tuning without retraining. Empirical results on synthetic and real-world benchmarks show energy-inspired schedules can outperform hand-crafted baselines in few-step sampling, with task-dependent schedule preferences, particularly benefiting code generation and reasoning tasks.

Abstract

We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.

Masked Diffusion Models as Energy Minimization

TL;DR

This work reframes Masked Diffusion Models as energy-minimizing discrete transport processes, unifying kinetic, conditional kinetic, and geodesic viewpoints. It proves an optimal mask-schedule condition that minimizes all three energies and links discrete masking to geodesic interpolation on probability space. To make scheduling practical, it introduces a Beta-CDF parameterization of , reducing the search to two dimensions and enabling post-training task-adaptive tuning without retraining. Empirical results on synthetic and real-world benchmarks show energy-inspired schedules can outperform hand-crafted baselines in few-step sampling, with task-dependent schedule preferences, particularly benefiting code generation and reasoning tasks.

Abstract

We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.

Paper Structure

This paper contains 34 sections, 16 theorems, 76 equations, 7 figures, 9 tables.

Key Result

Theorem 3.1

For any weight function $\gamma_t$ and MDM with mask schedule $\alpha_t$, the marginal and conditional kinetic energies are proportional: where $C_1$ is a scalar depending only on the sequence length $n$ and vocabulary size $d$. As a result, the two objectives share the same minimizers:

Figures (7)

  • Figure 1: Illustration of the theoretical results of this paper.
  • Figure 2: Distinct weight functions $\gamma_t$ shape different energy landscapes and consequently yield different optimal mask schedules $\alpha_t^\star$. Axes represent the beta-parameterization of $\alpha_t$ (see Sec. \ref{['sub:Energy-Inspired']}). Color intensity indicates energy values from Eq. (\ref{['eq:energy_n=1']}). Red stars mark the theoretical minima under the optimal schedule condition.
  • Figure 3: Beta-parameterized interpolation schedules and corresponding mask schedules. The left panel demonstrates beta-parameterized interpolation schedule morphologies, while the right panel displays corresponding optimal $\alpha_t^\star$ schedules derived via Condition \ref{['cond:optimal_schedule']}.
  • Figure 4: Toy experiments illustrating how different target distributions prefer different schedules. Each panel visualizes the effect of beta parameter tuning on sampling quality under limited step budgets by showing a target distribution and two distributions sampled by different schedules. More details of this experiment are provided in Appendix \ref{['app:toyexp']}.
  • Figure 5: Performance evaluation of energy-optimized schedules on LLaDA 8B llada. Each panel corresponds to a distinct benchmark. The x-axis displays sampling steps on a logarithmic scale, while the y-axis quantifies task performance, where higher values denote superior generation quality. Results on benchmarks where beta-parameterized schedules exhibit comparable yet not better performance are provided in Appendix. \ref{['app:raw']}.
  • ...and 2 more figures

Theorems & Definitions (29)

  • Definition 2.1: Weighted kinetic energy
  • Definition 2.2: Weighted conditional kinetic energy
  • Definition 2.3: Weighted geodesic energy
  • Theorem 3.1: Kinetic-conditional equivalence in MDMs
  • Theorem 3.2: Conditional-geodesic equivalence in MDMs
  • Example 3.3
  • Lemma 3.5: Geodesic energy minimization
  • Theorem 3.6: Kinetic energy minimization
  • Proposition 3.7
  • Lemma C.1
  • ...and 19 more