CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Junzhe Shen; Jieru Zhao; Ziwei He; Zhouhan Lin

CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin

TL;DR

CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens.

Abstract

We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.

CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

TL;DR

Abstract

Paper Structure (30 sections, 1 theorem, 34 equations, 2 figures, 6 tables)

This paper contains 30 sections, 1 theorem, 34 equations, 2 figures, 6 tables.

Introduction
Related Work
Continuous Diffusion Language Models
Hybrid Architectures
AR-Diffusion Hybrid
Continuous-Discrete Hybrid
Theoretical Analysis
Entropy and conditional total correlation.
Locality gap vs. dependence gap.
Proof sketch.
Interpretation and implications for rounding.
Why increasing $d$ helps but does not eliminate the issue.
Key observation.
Continuous Diffusion with Contextual AutoRegressive Decoder
Continuous Diffusion for Embedding Generation
...and 15 more sections

Key Result

Proposition 1

Let $\mathcal{D}_{\mathrm{pw}}$ be the set of all decoders that factorize as $\prod_{i=1}^L q_i(y_i\mid X_i)$, and let $\mathcal{D}_{\mathrm{seq}}$ be the set of all conditional sequence decoders $q(y\mid X)$. Consider the expected negative log-likelihood (NLL) risk Then,

Figures (2)

Figure 1: Framework of CoDAR . Starting from a noisy latent sequence $x_T$, a reverse diffusion process progressively denoises the hidden states to $x_0$. $\mathbf{x}_T,\ldots, \mathbf{x}_0 \in \mathrm{R}^{L\times d}$, where $L$ denotes the sequence length and $d$ denotes the size of hidden states. After that, an autoregressive Transformer decoder conditions on the denoised $\mathbf{x}_0$ with cross-attention to translate $\mathbf{x}_0$ to discrete tokens $\mathbf{y}_1, \ldots, \mathbf{y}_L$.
Figure 2: Token recovery rate of point-wise linear classifier and autoregressive Transformer decoder under different sizes of hidden states.

Theorems & Definitions (2)

Proposition 1: Optimality gap of pointwise decoding
proof

CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

TL;DR

Abstract

CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)