One-step Language Modeling via Continuous Denoising

Chanhyuk Lee; Jaehoon Yoo; Manan Agarwal; Sheel Shah; Jerry Huang; Aditi Raghunathan; Seunghoon Hong; Nicholas M. Boffi; Jinwoo Kim

One-step Language Modeling via Continuous Denoising

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

TL;DR

This work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale.

Abstract

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

One-step Language Modeling via Continuous Denoising

TL;DR

Abstract

Paper Structure (64 sections, 13 theorems, 76 equations, 17 figures, 6 tables)

This paper contains 64 sections, 13 theorems, 76 equations, 17 figures, 6 tables.

Introduction
Background & Related Work
Theoretical Framework
A continuous representation of language
Interpolants and flows for language modeling
Denoiser.
Relationship with discrete diffusion.
Flow maps for few-step language modeling
The two-time denoiser.
Flow maps in discrete diffusion.
Algorithmic Aspects
Flow-based language model ($\text{FLM}$)
Flow map language model ($\text{FMLM}$)
First stage.
Second stage.
...and 49 more sections

Key Result

Lemma 3.1

At each token position $l$, the optimal denoiser output equals the posterior probability over the vocabulary:

Figures (17)

Figure 1: Our flow map language model ($\text{FMLM}$) outperforms discrete diffusion models (gray) and matches the 8-step generation performance of few-step distilled discrete diffusion models (blue) in only one step, achieving an $\approx8.3\times$ speedup on LM1B.
Figure 2: Overview.Left: We leverage continuous interpolation between Gaussian noise and one-hot language encoding. Middle: Our flow-based language model ($\text{FLM}$) learns a denoiser that predicts clean data, which we convert into a flow for sampling. Right: Our distilled flow map language model ($\text{FMLM}$) directly transports states between distant timepoints, enabling few-step generation.
Figure 3: Factorization error in discrete diffusion. A toy dataset with two correlated modes new-york and san-diego. Left: In many-step sampling, both continuous flow and discrete diffusion generate valid data. Right: In few-step sampling, the factorized transition of discrete diffusion yields a spurious mixture of all possible combinations (including invalid pairings new-diego and san-york).
Figure 4: Decoding error rate over time across vocabulary sizes. Our time reparameterization $\tau(t)$ redistributes time so each step contributes uniformly to the denoising signal.
Figure 5: Generation performance of $\text{FLM}$ on LM1B (top) and OWT (bottom) compared to diffusion baselines.
...and 12 more figures

Theorems & Definitions (25)

Lemma 3.1
Proposition 3.2
Definition 2.1: Flow map
Proposition 2.2: Flow map characterizations
Corollary 2.3: Tangent condition
Proposition 2.4: Map distillation
Proposition 2.5: Self-distillation
Definition 3.1: Endpoint denoiser
Lemma 3.2: Denoiser-velocity relation
proof
...and 15 more

One-step Language Modeling via Continuous Denoising

TL;DR

Abstract

One-step Language Modeling via Continuous Denoising

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (25)