FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion
Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta
TL;DR
This work tackles the latency bottlenecks of diffusion language models (DLMs) by introducing two training-free techniques: FreeCache, which reduces KV recomputation through a progressively frozen-window caching strategy, and Guided Diffusion, which pairs the DLM with a lightweight autoregressive guider to safely unmask tokens in parallel. Together, these methods yield substantial end-to-end speedups (average 12.14x overall, up to 34x in long-context tasks) with negligible accuracy degradation, enabling DLMs to match or exceed autoregressive latency on multiple benchmarks. The approach is validated on Dream-7B-Instruct and LLaDA-8B-Instruct across reasoning and QA tasks, showing strong cross-domain performance and robust long-context support without any finetuning. The work highlights a practical pathway for deploying diffusion-based reasoning systems at scale and in high-throughput settings.
Abstract
Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized autoregressive (AR) models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver an average of 12.14x end-to-end speedup across various tasks with negligible accuracy degradation. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.
