Locally Coherent Parallel Decoding in Diffusion Language Models

Michael Hersche; Nicolas Menet; Ronan Tanios; Abbas Rahimi

Locally Coherent Parallel Decoding in Diffusion Language Models

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

Abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

Locally Coherent Parallel Decoding in Diffusion Language Models

Abstract

Paper Structure (48 sections, 5 theorems, 28 equations, 5 figures, 3 tables)

This paper contains 48 sections, 5 theorems, 28 equations, 5 figures, 3 tables.

Introduction
Key Challenge: Good at Global Drafting, Bad at Local Coherence.
This work: Parallel Sampling via Local Coherence.
Preliminaries
Univariate Discrete Diffusion
Forward Process (Noising).
Reverse Process (Denoising).
Multivariate Discrete Diffusion
Forward Process (Noising).
Reverse Process (Denoising).
Evidence Lower Bound
Method
Local Coherence Reduces the NELBO
Modeling the Local Joint Probability with AR
Soft-Conditioning as a Sufficient DLM-AR Interface
...and 33 more sections

Key Result

Theorem 3.2

Consider discrete diffusion on random sequences $\mathbf x_0 = [b_0^1, b_0^2, \ldots, b_0^{L/B}]$ where $b_t^i \in \mathcal{W}$, and a denoising model $p_\theta$ adopting the block independence bias of Definition def:block_factorization. Then the smallest possible NELBO is Further, suppose $b_t^i = [x_t^{(i-1)\cdot B+1}, \ldots, x_t^{i \cdot B}]$ are blocks of tokens $x_t^k \in \mathcal{V}$. Then

Figures (5)

Figure 1: Our CoDiLA in action. a) An example of incoherent text generated by Dream-Coder-Instruct-7B in the first iteration. Due to independent modeling of marginal distributions, it predicts the incoherent token "problem" (Top-1). b) This work enforces local coherence using a block-wise AR model conditioned on soft local tokens. In this example, it recovers coherence by retrieving the correct token "(list" from the Top-3 candidates. Displayed prompt was simplified for illustrative purposes.
Figure 2: CoDiLA with a block size of $B=4$. This example depicts the prediction of the first block ($b^1$). First, the DLM computes the token-wise conditional marginal probability vectors ($\boldsymbol{\pi}_t^j$). Next, we perform soft-conditioning by computing the expected embedding ($\mathbf{e}_t^j$) over the AR model's embedding matrix ($\mathbf{E}_\phi$), weighted by these marginals. Finally, the AR model receives these soft tokens, encapsulated by <think> and <$\backslash$think> boundary tokens, to autoregressively decode a locally coherent sequence.
Figure 3: Larger block sizes ($B$) reduce the training loss. We compute the average perplexity weighted by the masking ratio (see Equation \ref{['eq:cross_entropy_upper_bound']}), and display the moving average over 10 samples. The forward process always masks blocks of $8$ contiguous tokens.
Figure 4: Inference with static parallelism. We report on Pass@1 (%) vs. Throughput (tokens/sec, batch-size 1) on a single NVIDIA A100-80GB GPU. We compare the base DLM xie_dreamcoder_2025, ADJUST bansal_enabling_2025, and our CoDiLA, all built on Dream-Coder-Instruct-7B. Parallelism is controlled by unmasking a fixed number of tokens per iteration. CoDiLA consistently achieves higher accuracy at equivalent throughput levels.
Figure 5: Inference with dynamic parallelism. We operate a dynamic CoDiLA ($B=4$) with different entropy thresholds ($\tau$).

Theorems & Definitions (11)

Definition 2.1: Conditional Token Independence
Definition 3.1: Conditional Block Independence
Theorem 3.2
Theorem 3.3
Remark 3.4
Theorem : Restatement of \ref{['thm:nelbo_lowest_achievable']}
proof
Theorem : Restatement of Theorem \ref{['thm:soft_vs_hard']}
proof
Proposition 2.1: Closed-Form KL-Divergence for Masked Diffusion according to sahoo_simple_2024shi_simplified_2024gong_scaling_2025
...and 1 more

Locally Coherent Parallel Decoding in Diffusion Language Models

Abstract

Locally Coherent Parallel Decoding in Diffusion Language Models

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)