Table of Contents
Fetching ...

Locally Coherent Parallel Decoding in Diffusion Language Models

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

Abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

Locally Coherent Parallel Decoding in Diffusion Language Models

Abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.
Paper Structure (48 sections, 5 theorems, 28 equations, 5 figures, 3 tables)

This paper contains 48 sections, 5 theorems, 28 equations, 5 figures, 3 tables.

Key Result

Theorem 3.2

Consider discrete diffusion on random sequences $\mathbf x_0 = [b_0^1, b_0^2, \ldots, b_0^{L/B}]$ where $b_t^i \in \mathcal{W}$, and a denoising model $p_\theta$ adopting the block independence bias of Definition def:block_factorization. Then the smallest possible NELBO is Further, suppose $b_t^i = [x_t^{(i-1)\cdot B+1}, \ldots, x_t^{i \cdot B}]$ are blocks of tokens $x_t^k \in \mathcal{V}$. Then

Figures (5)

  • Figure 1: Our CoDiLA in action. a) An example of incoherent text generated by Dream-Coder-Instruct-7B in the first iteration. Due to independent modeling of marginal distributions, it predicts the incoherent token "problem" (Top-1). b) This work enforces local coherence using a block-wise AR model conditioned on soft local tokens. In this example, it recovers coherence by retrieving the correct token "(list" from the Top-3 candidates. Displayed prompt was simplified for illustrative purposes.
  • Figure 2: CoDiLA with a block size of $B=4$. This example depicts the prediction of the first block ($b^1$). First, the DLM computes the token-wise conditional marginal probability vectors ($\boldsymbol{\pi}_t^j$). Next, we perform soft-conditioning by computing the expected embedding ($\mathbf{e}_t^j$) over the AR model's embedding matrix ($\mathbf{E}_\phi$), weighted by these marginals. Finally, the AR model receives these soft tokens, encapsulated by <think> and <$\backslash$think> boundary tokens, to autoregressively decode a locally coherent sequence.
  • Figure 3: Larger block sizes ($B$) reduce the training loss. We compute the average perplexity weighted by the masking ratio (see Equation \ref{['eq:cross_entropy_upper_bound']}), and display the moving average over 10 samples. The forward process always masks blocks of $8$ contiguous tokens.
  • Figure 4: Inference with static parallelism. We report on Pass@1 (%) vs. Throughput (tokens/sec, batch-size 1) on a single NVIDIA A100-80GB GPU. We compare the base DLM xie_dreamcoder_2025, ADJUST bansal_enabling_2025, and our CoDiLA, all built on Dream-Coder-Instruct-7B. Parallelism is controlled by unmasking a fixed number of tokens per iteration. CoDiLA consistently achieves higher accuracy at equivalent throughput levels.
  • Figure 5: Inference with dynamic parallelism. We operate a dynamic CoDiLA ($B=4$) with different entropy thresholds ($\tau$).

Theorems & Definitions (11)

  • Definition 2.1: Conditional Token Independence
  • Definition 3.1: Conditional Block Independence
  • Theorem 3.2
  • Theorem 3.3
  • Remark 3.4
  • Theorem : Restatement of \ref{['thm:nelbo_lowest_achievable']}
  • proof
  • Theorem : Restatement of Theorem \ref{['thm:soft_vs_hard']}
  • proof
  • Proposition 2.1: Closed-Form KL-Divergence for Masked Diffusion according to sahoo_simple_2024shi_simplified_2024gong_scaling_2025
  • ...and 1 more