Unifying Autoregressive and Diffusion-Based Sequence Generation
Nima Fathi, Torsten Scholak, Pierre-André Noël
TL;DR
The paper addresses the division between autoregressive and diffusion-based sequence generation by presenting hyperschedules that assign position-dependent noise to token positions, effectively unifying AR and diffusion as a continuum. It introduces two hybrid forward processes, $\gamma$-Hybrid and $\epsilon$-Hybrid, along with an Adaptive Correction Sampler to enable models to revise earlier decisions, and it leverages attention-m masking with KV-caching for efficiency. Empirically, the approach yields state-of-the-art perplexity among discrete diffusion models on OpenWebText and LM1B, improves zero-shot generalization across multiple datasets, and demonstrates favorable quality-diversity trade-offs in generated sequences. The results suggest a promising path toward autoregressive-diffusion sequence generation with practical benefits in training efficiency, inference speed, and controllable generation.
Abstract
We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. We introduce hyperschedules, which assign distinct noise schedules to individual token positions, generalizing both autoregressive models (e.g., GPT) and conventional diffusion models (e.g., SEDD, MDLM) as special cases. Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes, and we introduce a novel inference algorithm that leverages this new feature in a simplified context inspired from MDLM. To support efficient training and inference, we design attention masks compatible with KV-caching. Our methods achieve state-of-the-art perplexity and generate diverse, high-quality sequences across standard benchmarks, suggesting a promising path for autoregressive diffusion-based sequence generation. See code and resources at https://hdlm-colm.github.io/
