Table of Contents
Fetching ...

From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models

Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan, Jiantao Jiao

TL;DR

This work analyzes parallel decoding for Diffusion Language Models and reveals that prioritizing high-confidence tokens is information-theoretically inefficient, as the total sequence information must be revealed over multiple rounds. It proves a fundamental lower bound linking decoding rounds to the total information content and per-round information budget, motivating strategies to maximize information throughput. The authors introduce Explore-Then-Exploit (ETE), a training-free decoding framework that combines fast cross-block diffusion with principled exploration to target high-information tokens, triggering cascades that yield many confident predictions in fewer rounds. Empirically, ETE outperforms confidence-based baselines across four benchmarks, achieving substantial reductions in decoding rounds with maintained or improved generation quality, and identifies a regime where exploration incurs minimal overhead. The results offer a principled path toward closing the gap between parallel diffusion decoding and the joint distribution it aims to approximate.

Abstract

Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample's total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.

From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models

TL;DR

This work analyzes parallel decoding for Diffusion Language Models and reveals that prioritizing high-confidence tokens is information-theoretically inefficient, as the total sequence information must be revealed over multiple rounds. It proves a fundamental lower bound linking decoding rounds to the total information content and per-round information budget, motivating strategies to maximize information throughput. The authors introduce Explore-Then-Exploit (ETE), a training-free decoding framework that combines fast cross-block diffusion with principled exploration to target high-information tokens, triggering cascades that yield many confident predictions in fewer rounds. Empirically, ETE outperforms confidence-based baselines across four benchmarks, achieving substantial reductions in decoding rounds with maintained or improved generation quality, and identifies a regime where exploration incurs minimal overhead. The results offer a principled path toward closing the gap between parallel diffusion decoding and the joint distribution it aims to approximate.

Abstract

Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample's total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.

Paper Structure

This paper contains 40 sections, 1 theorem, 21 equations, 6 figures, 3 algorithms.

Key Result

Theorem 3.2

Consider any parallel decoding schedule and any length-$n$ sequence $\mathbf{x}=(x^1,\ldots,x^n)$ satisfying Assumption asp:dynamic_threshold, and let the total approximation error $\epsilon$ be defined in Eq. eq:epsilon. Then the number of rounds $R$ must satisfy

Figures (6)

  • Figure 1: Side-by-side comparison of (implicit) exploration and exploitation across rounds distribution and bits distribution for each factor $f$. For each factor, four bars show: exploration rounds % , exploration bits % , exploitation rounds %, and exploitation bits %, with each proportion computed by averaging over the same 200 samples from GSM8K dataset. Colored boxes display efficiency ratios (Bits % / Rounds %) for exploration (red) and exploitation (blue).
  • Figure 2: Panel (a): Block diffusion unlocks the next block after the current block is fully unmasked; Panel (b): Fast block diffusion (with budget 1) unlocks the next block after the budget exhausts and retains the ability to unmask prior blocks, enabling faster decoding. In both figures, we use red arrows to indicate unlocking the next block.
  • Figure 3: Panel (a): Exploitation rounds required versus partial log probability contributed. Each point represents a single sample from GSM8K dataset, with colors indicating different choice of factors $f$. We measure partial bits (excluding implicit exploration contributions) and exploitation rounds ($R - R_{\rm explore}$, excluding implicit exploration rounds) to isolate the effects of purely confidence-based parallel decoding. This decomposition removes the implicit exploration that occurs in confidence-based decoding, better demonstrating our theory; Panel (b): Number of rounds and bits per exploitation step as a function of factor $f$. Each point represents the mean over 200 fresh samples in GSM8K dataset for a given factor, with 95 % confidence intervals shown as shaded regions.
  • Figure 4: Average wall-clock time for batched forward passes with different beam sizes under KV cache on NVIDIA H100 and B200 GPU. We adapted the Fast-DLLM codebase for KV-cache implementation. To simulate realistic decoding scenarios, we generate sequences of length 512 with block length 64 (8 blocks total), using Fast-DLLM's confidence-based decoding algorithm. At the first 4 decoding steps of each block and for each beam size $k$, we perform an exploration round by selecting $k$ candidate tokens and creating $k$ hypothesis sequences through position substitution in the current block. We then perform a batched forward pass on the $k$ hypothesis sequences using the shared KV cache, repeating each batched forward 10 times to compute the average latency. This yields 32 exploration rounds in total (4 steps $\times$ 8 blocks), from which we compute the mean and 95% confidence interval across different exploration positions for each beam size.
  • Figure 5: Accuracy-steps frontiers of our method versus baseline on four benchmarks.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 3.2: Step Lower Bound for Confidence-Based Parallel Decoding
  • proof