Table of Contents
Fetching ...

CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin

TL;DR

CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens.

Abstract

We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.

CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

TL;DR

CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens.

Abstract

We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.
Paper Structure (30 sections, 1 theorem, 34 equations, 2 figures, 6 tables)

This paper contains 30 sections, 1 theorem, 34 equations, 2 figures, 6 tables.

Key Result

Proposition 1

Let $\mathcal{D}_{\mathrm{pw}}$ be the set of all decoders that factorize as $\prod_{i=1}^L q_i(y_i\mid X_i)$, and let $\mathcal{D}_{\mathrm{seq}}$ be the set of all conditional sequence decoders $q(y\mid X)$. Consider the expected negative log-likelihood (NLL) risk Then,

Figures (2)

  • Figure 1: Framework of CoDAR . Starting from a noisy latent sequence $x_T$, a reverse diffusion process progressively denoises the hidden states to $x_0$. $\mathbf{x}_T,\ldots, \mathbf{x}_0 \in \mathrm{R}^{L\times d}$, where $L$ denotes the sequence length and $d$ denotes the size of hidden states. After that, an autoregressive Transformer decoder conditions on the denoised $\mathbf{x}_0$ with cross-attention to translate $\mathbf{x}_0$ to discrete tokens $\mathbf{y}_1, \ldots, \mathbf{y}_L$.
  • Figure 2: Token recovery rate of point-wise linear classifier and autoregressive Transformer decoder under different sizes of hidden states.

Theorems & Definitions (2)

  • Proposition 1: Optimality gap of pointwise decoding
  • proof