On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li
TL;DR
This work formalizes and contrasts three generation paradigms—autoregressive models (ARM), masked diffusion models (MDM), and a proposed any-process diffusion model (AP-MDM). It proves that MDM achieves universal computation with sufficiently large context and parallel decoding but, under practical resource constraints, does not surpass ARM for hard tasks; it then introduces AP-MDM with edit operations (unmask, remask, insert, delete) enabling self-correction and structural editing. The authors show AP-MDM can solve substantially harder problems (e.g., NP-hard and certain PSPACE tasks) with test-time scaling, and demonstrate examples in Sudoku, coding, biology, and graph editing, while also proving fundamental limitations on simulating AP-MDM with constant-depth ARM. The results suggest concrete design directions for future LLMs: adopt any-process style generation to handle non-sequential, structure-rich domains and to improve learning efficiency and OOD generalization across tasks like coding and scientific reasoning.
Abstract
Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.
