Table of Contents
Fetching ...

Unifying Autoregressive and Diffusion-Based Sequence Generation

Nima Fathi, Torsten Scholak, Pierre-André Noël

TL;DR

The paper addresses the division between autoregressive and diffusion-based sequence generation by presenting hyperschedules that assign position-dependent noise to token positions, effectively unifying AR and diffusion as a continuum. It introduces two hybrid forward processes, $\gamma$-Hybrid and $\epsilon$-Hybrid, along with an Adaptive Correction Sampler to enable models to revise earlier decisions, and it leverages attention-m masking with KV-caching for efficiency. Empirically, the approach yields state-of-the-art perplexity among discrete diffusion models on OpenWebText and LM1B, improves zero-shot generalization across multiple datasets, and demonstrates favorable quality-diversity trade-offs in generated sequences. The results suggest a promising path toward autoregressive-diffusion sequence generation with practical benefits in training efficiency, inference speed, and controllable generation.

Abstract

We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. We introduce hyperschedules, which assign distinct noise schedules to individual token positions, generalizing both autoregressive models (e.g., GPT) and conventional diffusion models (e.g., SEDD, MDLM) as special cases. Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes, and we introduce a novel inference algorithm that leverages this new feature in a simplified context inspired from MDLM. To support efficient training and inference, we design attention masks compatible with KV-caching. Our methods achieve state-of-the-art perplexity and generate diverse, high-quality sequences across standard benchmarks, suggesting a promising path for autoregressive diffusion-based sequence generation. See code and resources at https://hdlm-colm.github.io/

Unifying Autoregressive and Diffusion-Based Sequence Generation

TL;DR

The paper addresses the division between autoregressive and diffusion-based sequence generation by presenting hyperschedules that assign position-dependent noise to token positions, effectively unifying AR and diffusion as a continuum. It introduces two hybrid forward processes, -Hybrid and -Hybrid, along with an Adaptive Correction Sampler to enable models to revise earlier decisions, and it leverages attention-m masking with KV-caching for efficiency. Empirically, the approach yields state-of-the-art perplexity among discrete diffusion models on OpenWebText and LM1B, improves zero-shot generalization across multiple datasets, and demonstrates favorable quality-diversity trade-offs in generated sequences. The results suggest a promising path toward autoregressive-diffusion sequence generation with practical benefits in training efficiency, inference speed, and controllable generation.

Abstract

We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. We introduce hyperschedules, which assign distinct noise schedules to individual token positions, generalizing both autoregressive models (e.g., GPT) and conventional diffusion models (e.g., SEDD, MDLM) as special cases. Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes, and we introduce a novel inference algorithm that leverages this new feature in a simplified context inspired from MDLM. To support efficient training and inference, we design attention masks compatible with KV-caching. Our methods achieve state-of-the-art perplexity and generate diverse, high-quality sequences across standard benchmarks, suggesting a promising path for autoregressive diffusion-based sequence generation. See code and resources at https://hdlm-colm.github.io/

Paper Structure

This paper contains 47 sections, 21 equations, 15 figures, 12 tables, 4 algorithms.

Figures (15)

  • Figure 1: Generative diffusion models prescribe (through $q_{{t|t+1}}$) a curriculum process $\{\mathbf{X}_t\}$, then learn (through $p_{{t+1|t}}^{\boldsymbol{\theta}}$) a reverse process $\{\hat{\mathbf{X}}_t\}$ so that the marginal distributions match at each step $t$ (vertical squiggly lines). $\mathbf{X}_T$ is the training dataset and $\hat{\mathbf{X}}_T$ is the generated output. This work focuses on discrete diffusion for sequences of discrete tokens. We show that standard autoregressive models (e.g., GPT) are an extreme case of this framework, a unification enabling a vast continuum of diffusion models, including autoregressive ones.
  • Figure 2: We introduce $\boldsymbol{\tau}$-hyperschedules (top row) -- subjecting different token positions $i$ with different noise levels (red high; blue low) at different generation step $t$ -- and illustrate how three different noising processes (bottom 3 rows) can be modulated by such hyperschedules. Hyperschedules. (a) Standard AR models (e.g., GPT) determine tokens one by one, "quenching" each of them to full determination in a single step. They may thus be construed as an extreme case of a diffusion model. (b) Standard diffusion models (e.g., SEDD) gradually anneal all tokens independently of their position. (c) Block-wise application of flat annealing, here for blocks of width $\omega = 3$. (d) Annealing with a sliding window ("smoothed" AR), here using window width $\omega = 3$. These last two examples share important features of both AR and diffusion models. While the 4 presented examples all generate $\rho=1$ token per step in the long-sequence limit (with the caveat that slide experiences an initial overhead of $\omega-1$ steps), the last 3 patterns are all straightforwardly adapted to $\rho > 1$ ("quick draft") and $\rho < 1$ ("think hard") regimes. Noising Processes. (2) The absorb noising process -- a.k.a. Masked Diffusion Model (MDM) -- overwrites tokens with a special $\text{MASK}$ token. These masks are "known unknowns": it is clear to a denoising model that it must put a non-mask token in its stead. Conversely, unmasked tokens are taken as "absolute truth" during generation: once a token has been unmasked, it remains unaltered until the end. (3) The uniform noising process overwrites with (non-mask) tokens selected uniformly at random from the vocabulary of possible tokens. The orange/light-blue color coding is not part of the state $\mathbf{x}_t$, and is provided solely for the reader's convenience. The model thus has no direct way to know if a token has been altered before ("unknown unknowns"), and may thus revisit a position's value many times during generation. (4) The hybrid noising process blends a little bit of uniform into the absorb process. When denoising, $\text{MASK}$ tokens still represent clear "known unknowns", while unmasked ones have become "a priori good" candidates that may however need to be "fixed". We argue that it is desirable for models to learn to fix their own mistakes.
  • Figure 3: Left: Generative perplexity as a function of token-level entropy. Right: Generative perplexity versus MAUVE score. Our models consistently outperform baselines, achieving lower perplexity at comparable levels of diversity and fluency.
  • Figure 4: Two transformer-based sequence generators for $d=4$. (\ref{['figure:transformer:aligned']}) The Aligned configuration of standard diffusion models is reminiscent of masked language models. (\ref{['figure:transformer:shifted']}) The Shifted configuration is closer to autoregressive language models. Here $\hat{x}^{-1}$ represent a token solely part of the conditioning (i.e., not generated), and may or may not be constant (e.g., BOS). Similarly, represents that the output associated with the last token is discarded. Our position-based indexing abstracts away these details.
  • Figure 5: Example of attention mask for Aligned and Shifted configurations. Although these naive masks are appropriate for inference, directly training on them would be inefficient; see Figures \ref{['figure:aligned-slide']}--\ref{['figure:shifted-block']} for training-ready masks examples.
  • ...and 10 more figures