Table of Contents
Fetching ...

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma

TL;DR

The paper tackles the fairness gap in comparing autoregressive (AR) and masked diffusion models (MDMs) by explicitly decoupling the modeling paradigm from architecture, introducing AO-GPT as a decoder-only AO-AR model. It demonstrates that while AO-AR can match AR in density and generation under certain conditions, encoder-only MDMs and the vast permutation space pose distinct challenges, with decoders offering substantial generation speedups. Key findings show that AO-AR converges slower than left-to-right AR in decoder-only setups, but a small amount of L2R data and targeted position-informed training significantly narrows performance gaps. The work also reveals dramatic differences in conditional probability spaces between encoder-only and decoder-only AO-AR and proposes architectural enhancements (AdaLN, EMA) and parallel generation masks to improve efficiency and stability, highlighting AO-GPT’s potential to unify AR and MDM approaches across domains.

Abstract

Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups ($\sim25\times$) and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

TL;DR

The paper tackles the fairness gap in comparing autoregressive (AR) and masked diffusion models (MDMs) by explicitly decoupling the modeling paradigm from architecture, introducing AO-GPT as a decoder-only AO-AR model. It demonstrates that while AO-AR can match AR in density and generation under certain conditions, encoder-only MDMs and the vast permutation space pose distinct challenges, with decoders offering substantial generation speedups. Key findings show that AO-AR converges slower than left-to-right AR in decoder-only setups, but a small amount of L2R data and targeted position-informed training significantly narrows performance gaps. The work also reveals dramatic differences in conditional probability spaces between encoder-only and decoder-only AO-AR and proposes architectural enhancements (AdaLN, EMA) and parallel generation masks to improve efficiency and stability, highlighting AO-GPT’s potential to unify AR and MDM approaches across domains.

Abstract

Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups () and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.

Paper Structure

This paper contains 27 sections, 1 theorem, 10 equations, 12 figures, 5 tables.

Key Result

Lemma 1

For sampling $\boldsymbol{x}_s^i$ from $q_{s|t}(\boldsymbol{x}_s^i|\boldsymbol{x}_t)$ as defined in Equation eq:q_st_definition_main when $\boldsymbol{x}_t^i = [\text{MASK}]$, an equivalent sampling procedure is: 1. Sample a binary variable $b \sim \text{Bernoulli}\left(\frac{s}{t}\right)$. 2. If $b

Figures (12)

  • Figure 1: Training loss curves comparing a standard AR GPT against an AO-GPT. Both models employ an identical decoder-only architecture, with AO-GPT demonstrating slower initial convergence.
  • Figure 2: (a) Convergence speed with different fixed prediction orders: left-to-right, fixed random, and fixed block-wise random. (b) Impact of adding 10% left-to-right (L2R) data to AO-GPT training on its L2R and any-order loss.
  • Figure 3: Zero-shot unconditional perplexity (↓) for varying ensemble sizes. An ensemble size of 1 represents the baseline model without ensembling.
  • Figure 4: Generation time versus number of generation steps with sequence length $1024$ and batch size$=32$ for decoder-only AO-AR models (with KV-cache and Lemma \ref{['lemma:efficient_sampling']}) and their encoder-only counterparts (SEDD).
  • Figure 5: Target position injection strategies for decoder-only AO-AR model.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Remark 1: Identical Loss Lower Bound
  • Lemma 1: Efficient Sampling Algorithm
  • proof