DAWN: Dependency-Aware Fast Inference for Diffusion LLMs
Lizhuo Luo, Zhuoran Shi, Jiajun Luo, Zhi Wang, Shen Ren, Wenya Wang, Tianwei Zhang
TL;DR
DAWN addresses the efficiency gap in diffusion LLM inference caused by inter-token dependencies. It introduces a training-free, dependency-aware decoding framework that constructs a sparse dependency graph from attention maps and uses Anchor-Guided Decoding and Conflict-Based Scheduling to select safe parallel updates. The method relaxes high-confidence thresholds for anchored induced positions and prevents strongly coupled low-confidence updates from colliding, enabling higher parallelism with minimal quality loss. Extensive experiments across models and datasets show speedups of 1.8–8.1× with comparable accuracy to baseline methods. The approach offers a practical, training-free route to faster diffusion-based text generation in real-world deployments.
Abstract
Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo-luo/DAWN.
