Table of Contents
Fetching ...

DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

Lizhuo Luo, Zhuoran Shi, Jiajun Luo, Zhi Wang, Shen Ren, Wenya Wang, Tianwei Zhang

TL;DR

DAWN addresses the efficiency gap in diffusion LLM inference caused by inter-token dependencies. It introduces a training-free, dependency-aware decoding framework that constructs a sparse dependency graph from attention maps and uses Anchor-Guided Decoding and Conflict-Based Scheduling to select safe parallel updates. The method relaxes high-confidence thresholds for anchored induced positions and prevents strongly coupled low-confidence updates from colliding, enabling higher parallelism with minimal quality loss. Extensive experiments across models and datasets show speedups of 1.8–8.1× with comparable accuracy to baseline methods. The approach offers a practical, training-free route to faster diffusion-based text generation in real-world deployments.

Abstract

Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo-luo/DAWN.

DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

TL;DR

DAWN addresses the efficiency gap in diffusion LLM inference caused by inter-token dependencies. It introduces a training-free, dependency-aware decoding framework that constructs a sparse dependency graph from attention maps and uses Anchor-Guided Decoding and Conflict-Based Scheduling to select safe parallel updates. The method relaxes high-confidence thresholds for anchored induced positions and prevents strongly coupled low-confidence updates from colliding, enabling higher parallelism with minimal quality loss. Extensive experiments across models and datasets show speedups of 1.8–8.1× with comparable accuracy to baseline methods. The approach offers a practical, training-free route to faster diffusion-based text generation in real-world deployments.

Abstract

Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo-luo/DAWN.
Paper Structure (20 sections, 5 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 5 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Attention Sinks in dLLMs. We conduct experiments on multiple samples using LLaDA-8B-Instruct. Left: The two heatmaps show partial attention maps from the same layer at different denoising steps, illustrating that the attention sink shifts across denoising iterations. Right: The third plot reports the distribution of attention scores corresponding to the first plot, and the rightmost plot reports the frequency of sink tokens from multiple samples.
  • Figure 2: Heatmap of Induced Consistency We conduct experiments with LLaDA-8B-Instruct on sampled instances from GSM8K and HumanEval. For each request, we identify coupled pairs where anchors (prompts or previously unmasked tokens) influence induced positions (currently masked positions), and measure whether each induced token’s prediction matches the final decoded output (consistency ratio). Gray cells indicate bins with a negligible fraction of samples and are excluded from analysis.
  • Figure 3: Overview of DAWN. Left: Dependency Graph Construction preprocesses the attention map and extracts a sparse directed dependency graph by retaining only salient (high-score) attention links. Middle: guided by this graph, Anchor-Guided Decoding and Conflict-Based Scheduling select two sets of positions, and the union of selected positions is unmasked simultaneously.
  • Figure 4: Effectiveness of DAWN and the original sampler on HumanEval under different generation lengths ($L\in\{128,256,512,1024\}$). Bars report accuracy (left y-axis) and solid lines report TPS (right y-axis). Left and right figures correspond to LLaDA-8B-Instruct and Dream-v0-Instruct-7B.
  • Figure 5: Effectiveness of DAWN on HumanEval under different block lengths ($L\in\{8,16,32,64\}$). We report accuracy (blue, left y-axis) and TPS (red, right y-axis). Left and right figures correspond to LLaDA-8B-Instruct and Dream-v0-Instruct-7B.
  • ...and 5 more figures