Table of Contents
Fetching ...

Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

Linye Wei, Wenjue Chen, Pingzhi Tang, Xiaotian Guo, Le Ye, Runsheng Wang, Meng Li

TL;DR

The paper tackles inefficiencies in diffusion-based LLM inference caused by bidirectional attention that blocks KV caching, leading to a compute-bound prefill phase. It introduces ODB-dLLM, an arithmetic-intensity–aware framework that couples adaptive prefill length prediction with a dLLM-specific jump-share speculative decoding strategy to balance compute and memory workloads across dual boundaries. Experimental results across multiple benchmarks show substantial speedups (tens to hundreds of times faster than baseline and several-fold faster than Fast-dLLM) with improved or preserved accuracy, validating the effectiveness of the approach. This work offers a practical, training-free method to accelerate diffusion language models on common GPUs by optimizing both prefill and decoding dynamics.

Abstract

Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

TL;DR

The paper tackles inefficiencies in diffusion-based LLM inference caused by bidirectional attention that blocks KV caching, leading to a compute-bound prefill phase. It introduces ODB-dLLM, an arithmetic-intensity–aware framework that couples adaptive prefill length prediction with a dLLM-specific jump-share speculative decoding strategy to balance compute and memory workloads across dual boundaries. Experimental results across multiple benchmarks show substantial speedups (tens to hundreds of times faster than baseline and several-fold faster than Fast-dLLM) with improved or preserved accuracy, validating the effectiveness of the approach. This work offers a practical, training-free method to accelerate diffusion language models on common GPUs by optimizing both prefill and decoding dynamics.

Abstract

Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

Paper Structure

This paper contains 16 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Inference with ODB-dLLM on GSM8K dataset cobbe2021training. Comparison of the arithmetic intensity and speedup against prior framework on the roofline model of NVIDIA A100 GPU.
  • Figure 2: Inference with parallel decoding and DualCache.
  • Figure 3: Dual-boundary challenges for dLLM and the overview of ODB-dLLM's design.
  • Figure 4: Proportion of step counts and execution time across prefill and decoding phases.
  • Figure 5: (a) Distribution of the effective response lengths and (b) Analysis of next-step acceptance rates for tokens that were initially unaccepted (with top-2 confidence).
  • ...and 6 more figures