Table of Contents
Fetching ...

Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Xinyu Fu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, Haoliang Li

TL;DR

This work tackles unstable and inefficient inference in diffusion language models by introducing Coherent Contextual Decoding (CCD), which uses a trajectory-based, context-consistency measure grounded in conditional mutual information to rectify suboptimal decoding paths. It combines a marginalized-context target distribution with a sliding-window historical buffer to enable adaptive, per-step budgeting that favors coherent trajectories and accelerates sampling. Empirical results on Dream and LLaDA show substantial speedups (up to 3.48x) and performance gains across math, code, and planning benchmarks, with robust performance under varying hyperparameters. The approach integrates with existing inference optimizations and demonstrates the value of leveraging history-rich context for reliable, efficient diffusion-based generation.

Abstract

Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

TL;DR

This work tackles unstable and inefficient inference in diffusion language models by introducing Coherent Contextual Decoding (CCD), which uses a trajectory-based, context-consistency measure grounded in conditional mutual information to rectify suboptimal decoding paths. It combines a marginalized-context target distribution with a sliding-window historical buffer to enable adaptive, per-step budgeting that favors coherent trajectories and accelerates sampling. Empirical results on Dream and LLaDA show substantial speedups (up to 3.48x) and performance gains across math, code, and planning benchmarks, with robust performance under varying hyperparameters. The approach integrates with existing inference optimizations and demonstrates the value of leveraging history-rich context for reliable, efficient diffusion-based generation.

Abstract

Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

Paper Structure

This paper contains 18 sections, 21 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Framework of our proposed method. We define a historical buffer $\mathcal{H}_t$ at iteration $t$ that stores the predictive distributions from the most recent $d$ iterations (except for the current iteration $t$) with only the top-$V$ most confident tokens at each iteration. At the current iteration $t$, we also identify mask token positions that appear both in the current top-$V$ set and in the historical buffer to obtain the current buffer $\mathcal{H}_{t}^{c}$, which ensures consistently confident tokens with maximized effective contexts to conduct an approximated target distribution-based sampling procedure in Eq. (\ref{['pratical decoding']}).
  • Figure 2: This analysis investigates the Cumulative effective tokens (CET) and sampling budget using the Dream model on the Trip benchmark. The sequence length is set to 256. (a) We calculate the CET by excluding padding/EOS tokens from the total generated tokens. The decoding process includes periods where multiple EOS tokens are generated (visible as plateaus). (b) Sampling budget of each diffusion step of different sampling procedures. For our method, the decoding process is early stopped because all mask tokens are decoded.
  • Figure 3: Hyperparameter Analysis on buffer size and temperature coefficients using the Dream model. (a) The trade-off between score and computational steps as buffer size varies on the subset (City=3) of Trip benchmark. (b) Performance comparison across different temperature coefficients on the HumanEval benchmark.

Theorems & Definitions (2)

  • proof
  • proof