Table of Contents
Fetching ...

Fast-dLLM v2: Efficient Block-Diffusion LLM

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie

TL;DR

Fast-dLLM v2 tackles autoregressive decoding latency by converting pretrained AR models into block diffusion decoders that generate text in blocks with intra-block diffusion and cross-block conditioning. It achieves data-efficient adaptation, requiring only about 1B fine-tuning tokens, and employs a hierarchical caching scheme (block-level cache and DualCache sub-block cache) to accelerate decoding. Extensive experiments on Qwen-2.5-Instruct models up to 7B show that Fast-dLLM v2 matches or surpasses AR baselines in accuracy while delivering state-of-the-art efficiency among diffusion-based LLMs, with up to 2.5× speedups. This work demonstrates a practical pathway to deploy fast, high-quality diffusion-based generation in real-world applications.

Abstract

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Fast-dLLM v2: Efficient Block-Diffusion LLM

TL;DR

Fast-dLLM v2 tackles autoregressive decoding latency by converting pretrained AR models into block diffusion decoders that generate text in blocks with intra-block diffusion and cross-block conditioning. It achieves data-efficient adaptation, requiring only about 1B fine-tuning tokens, and employs a hierarchical caching scheme (block-level cache and DualCache sub-block cache) to accelerate decoding. Extensive experiments on Qwen-2.5-Instruct models up to 7B show that Fast-dLLM v2 matches or surpasses AR baselines in accuracy while delivering state-of-the-art efficiency among diffusion-based LLMs, with up to 2.5× speedups. This work demonstrates a practical pathway to deploy fast, high-quality diffusion-based generation in real-world applications.

Abstract

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Paper Structure

This paper contains 34 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Performance comparison of Fast-dLLM v2.(a) Comparison of throughput and GSM8K accuracy among baseline models and the Fast-dLLM variants in A100. Fast-dLLM v2 (7B) achieves 2.54$\times$ higher throughput than Qwen2.5-7B-Instruct while offering comparable accuracy. Additionally, it improves accuracy by +5.2% over Fast-dLLM-LLaDA, which is based on optimized LLaDA. (b) Throughput comparison under different batch sizes. Fast-dLLM v2 significantly outperforms all baselines at both batch size 1 and 4, demonstrating superior scalability and efficiency.
  • Figure 2: Training process of Fast-dLLM-v2. The input sequence is decoded block by block. Within each block, the model performs next-token prediction with partial masking. To ensure every token is trained, complementary masks are introduced so that masked tokens in one view can be predicted from the other. We only apply loss to predicted tokens that are highlighted in green, and dashed curves connect Mask tokens to their corresponding predictions.
  • Figure 3: Illustration of the inference process. The sequence is decoded block-by-block. The decoded blocks are cached to speed up inference. Within each block, we adopt the parallel decoding and DualCache in Fast-dLLM to further accelerate inference.
  • Figure 4: Accuracy and throughput under different thresholds on GSM8K. Threshold 0.9 is selected, offering a 2.6× speedup with minimal accuracy drop.
  • Figure 5: Throughput comparison between autoregressive and diffusion generation methods on NVIDIA A100 and H100 GPUs across varying batch sizes. Diffusion generation consistently outperforms autoregressive on both GPUs.
  • ...and 4 more figures