Table of Contents
Fetching ...

Accelerating Diffusion LLM Inference via Local Determinism Propagation

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Zirui Wu, Yu Tian, Victoria W., Guorui Zhou

TL;DR

Diffusion LLMs enable parallel token decoding but suffer from delayed decoding due to conservative per-step commitments. The authors analyze decoding dynamics and derive two empirical principles: local determinism propagation around high-confidence anchors and spatial consistency decay. They propose LocalLeap, a training-free, anchor-guided, localized parallel decoding strategy that reduces decoding steps and boosts throughput with negligible quality loss. Across multiple benchmarks and two open-source dLLMs, LocalLeap achieves up to 6.94x throughput and reduces inference steps to about 14% of the original, demonstrating practical acceleration for diffusion-based text generation. The work provides a plug-and-play approach with theoretical backing and extensive ablations.

Abstract

Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94$\times$ throughput improvements and reduces decoding steps to just 14.2\% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.

Accelerating Diffusion LLM Inference via Local Determinism Propagation

TL;DR

Diffusion LLMs enable parallel token decoding but suffer from delayed decoding due to conservative per-step commitments. The authors analyze decoding dynamics and derive two empirical principles: local determinism propagation around high-confidence anchors and spatial consistency decay. They propose LocalLeap, a training-free, anchor-guided, localized parallel decoding strategy that reduces decoding steps and boosts throughput with negligible quality loss. Across multiple benchmarks and two open-source dLLMs, LocalLeap achieves up to 6.94x throughput and reduces inference steps to about 14% of the original, demonstrating practical acceleration for diffusion-based text generation. The work provides a plug-and-play approach with theoretical backing and extensive ablations.

Abstract

Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94 throughput improvements and reduces decoding steps to just 14.2\% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.

Paper Structure

This paper contains 23 sections, 46 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Confidence-aware sequential decoding visualization and analysis. (a) Sequential greedy decoding visualization on a GSM8K instance using LLaDA-8B-Instruct. Gray: inconsistent predictions; Green: consistent predictions (intensity indicates confidence from 0 to 1); Red: decoded tokens. (b) Distribution of earliest stable consistency steps, showing 63.7% of tokens achieve final consistency within 25% of total decoding steps. (c) Confidence-consistency relationship demonstrating that 97.8% of predictions with confidence > 0.8 remain consistent with final outputs.
  • Figure 2: Heatmap of consistency analysis with confidence surrounding decoded tokens. Centered on the decoded tokens at each step, we analyze the trends in confidence and consistency across surrounding positions. (a) When a decoded token exhibit high confidence ($c\geq0.9$), its nearby positions maintain high consistency even at moderately lower confidence levels, while more distant positions require correspondingly higher confidence thresholds. (b) When decoding tokens exhibit low confidence ($c<0.9$), consistency remains poor even at sub-high confidence levels ($c\in(0.8, 0.9)$), indicating that premature decoding at these confidence levels may introduce errors.
  • Figure 3: Illustration of our LocalLeap decoding mechanism. At each decoding step, we first compute confidence scores for all masked tokens through a forward pass, then identify anchors (tokens with confidence $c\geq0.9$). We expand outward by $W$ positions from each anchor, creating local neighborhoods where tokens can be decoded using a relaxed confidence threshold $\tau = 0.75$. This allows certain tokens to bypass redundant optimization steps, thereby reducing the total number of decoding iterations.
  • Figure 4: Ablation study on hyperparameters: anchor trigger boundary $\kappa$, neighbor radius $W$ and local relaxed boundary $\tau$ for LLaDA-Instruct on HumanEval. The default setting is $\{\kappa=0.9,W=4,\tau=0.75\}$, and we perform univariate analysis for each variable. The blue line indicates accuracy, while the orange line indicates throughput. Each dashed line represents baseline performance for metrics of the same color.