Table of Contents
Fetching ...

Masked Diffusion Models are Secretly Learned-Order Autoregressive Models

Prateek Garg, Bhavya Kohli, Sunita Sarawagi

TL;DR

This work extends Masked Diffusion Models (MDMs) to learn favorable decoding orders in discrete data by employing multivariate noise schedules. The authors prove that the continuous-time ELBO of MDMs decomposes into a weighted auto-regressive loss over possible orders, with the schedule defining the order distribution and enabling state-independent learning of decoding sequences. They establish an exact correspondence between token ordering and inference-time schedules, and validate the theory on tabular data where learned schedules modestly improve validation loss while maintaining competitive data-fidelity metrics. Overall, the approach highlights how diffusion-based methods can implicitly discover and optimize token ordering, offering a principled path toward learnable, order-aware generative models for discrete domains with potential implications for speed-accuracy trade-offs and structure discovery in complex data.

Abstract

Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.

Masked Diffusion Models are Secretly Learned-Order Autoregressive Models

TL;DR

This work extends Masked Diffusion Models (MDMs) to learn favorable decoding orders in discrete data by employing multivariate noise schedules. The authors prove that the continuous-time ELBO of MDMs decomposes into a weighted auto-regressive loss over possible orders, with the schedule defining the order distribution and enabling state-independent learning of decoding sequences. They establish an exact correspondence between token ordering and inference-time schedules, and validate the theory on tabular data where learned schedules modestly improve validation loss while maintaining competitive data-fidelity metrics. Overall, the approach highlights how diffusion-based methods can implicitly discover and optimize token ordering, offering a principled path toward learnable, order-aware generative models for discrete domains with potential implications for speed-accuracy trade-offs and structure discovery in complex data.

Abstract

Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.

Paper Structure

This paper contains 14 sections, 5 theorems, 21 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.1

The diffusion loss (eqn:elbo-trajectory) can be decomposed over the orders as follows: where ${\bm \mu}({\mathbf x}_{\pi(i)}|{\mathbf x}_{\pi(<i)},t^*_{\pi(i)};\theta) = {\bm \mu}^{{\mathbf x}^{\pi(i)}_0}_{\pi(i)}({\mathbf x}_{\pi(<i)}, t^*_{\pi(i)} ; \theta)$, ${\mathbf x}_{\pi(<i)}$ is a sequence obtained by masking the $\pi(\geq i)$ indices of ${\mathbf x}_0$ and $t^*_{\pi(i)}$ is

Figures (5)

  • Figure 1: Forward process of masked diffusion, masks variables in a order. While for univariate noise schedules, this order is uniformly random, multivariate noise schedule makes some order more likely than others.
  • Figure 2: Comparing the best validation losses for ${\text{ MDM(LS)}}$ and ${\text{ MDM}}$ and visualizing the noise schedules learned by ${\text{ MDM(LS)}}$ on the Adult dataset.
  • Figure 3: Schedules learned by TabDiff, reproduced from original implementation
  • Figure 4: Tabdiff fails to learn schedules even with random initialization
  • Figure 5: Schedules learned by our implementation of ${\text{ MDM(LS)}}$. Note that our implementation uses masked diffusion for all feature columns and not just for categorical features.

Theorems & Definitions (5)

  • Proposition 3.1
  • Proposition 3.2
  • Corollary 3.1
  • Proposition B.1
  • Proposition B.1