Table of Contents
Fetching ...

On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond

Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li

TL;DR

This work formalizes and contrasts three generation paradigms—autoregressive models (ARM), masked diffusion models (MDM), and a proposed any-process diffusion model (AP-MDM). It proves that MDM achieves universal computation with sufficiently large context and parallel decoding but, under practical resource constraints, does not surpass ARM for hard tasks; it then introduces AP-MDM with edit operations (unmask, remask, insert, delete) enabling self-correction and structural editing. The authors show AP-MDM can solve substantially harder problems (e.g., NP-hard and certain PSPACE tasks) with test-time scaling, and demonstrate examples in Sudoku, coding, biology, and graph editing, while also proving fundamental limitations on simulating AP-MDM with constant-depth ARM. The results suggest concrete design directions for future LLMs: adopt any-process style generation to handle non-sequential, structure-rich domains and to improve learning efficiency and OOD generalization across tasks like coding and scientific reasoning.

Abstract

Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.

On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond

TL;DR

This work formalizes and contrasts three generation paradigms—autoregressive models (ARM), masked diffusion models (MDM), and a proposed any-process diffusion model (AP-MDM). It proves that MDM achieves universal computation with sufficiently large context and parallel decoding but, under practical resource constraints, does not surpass ARM for hard tasks; it then introduces AP-MDM with edit operations (unmask, remask, insert, delete) enabling self-correction and structural editing. The authors show AP-MDM can solve substantially harder problems (e.g., NP-hard and certain PSPACE tasks) with test-time scaling, and demonstrate examples in Sudoku, coding, biology, and graph editing, while also proving fundamental limitations on simulating AP-MDM with constant-depth ARM. The results suggest concrete design directions for future LLMs: adopt any-process style generation to handle non-sequential, structure-rich domains and to improve learning efficiency and OOD generalization across tasks like coding and scientific reasoning.

Abstract

Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.

Paper Structure

This paper contains 97 sections, 13 theorems, 83 equations, 12 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

For any PRAM program that runs on input $\mathbf{x} \in \Sigma^n$ in at most $T(n)$ parallel time with $P(n)$ maximum processors, there exists an MDM on input $\mathbf{x}$, padded to $S(n) = \mathcal{O}(P(n) \cdot T(n))$, that matches the PRAM output in $\mathcal{O}(T(n))$ decoding steps, i.e. $\tex

Figures (12)

  • Figure 1: Comparison between autoregressive generation, any-order generation (standard MDM) and any-process generation (our MDM).
  • Figure 2: Examples of any-process generation for different tasks.
  • Figure 3: Experimental results on Sudoku puzzles. Results of ARM and AO-MDM are taken from kim2025train. Losses are defined in \ref{['appendix:training']}.
  • Figure 4: Graph generation and parity task results.
  • Figure 5: Demonstration of value assignment.
  • ...and 7 more figures

Theorems & Definitions (34)

  • Definition 1: MDM
  • Definition 2: PRAM
  • Theorem 1: MDM Simulation of PRAM, Informal
  • Theorem 2
  • Definition 3: Masked-ARM
  • Theorem 3: Left-to-Right v.s. Any-Order, Informal
  • Theorem 4: AP-MDM Simulation of PRAM, Informal
  • Theorem 5: Generating Two-Sided Dyck-$k$, Informal
  • Theorem 6: Hardness of Simulating AP-MDM, Informal
  • Definition 4: Position-Indexed Seq-to-Embedding Function
  • ...and 24 more