Table of Contents
Fetching ...

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

Zijian Zhu, Fei Ren, Zhanhong Tan, Kaisheng Ma

TL;DR

This work proposes ES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance, and achieves throughput of up to 226.57 and 308.51 tokens per second on an NVIDIA H200 GPU.

Abstract

Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the vanilla implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

TL;DR

This work proposes ES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance, and achieves throughput of up to 226.57 and 308.51 tokens per second on an NVIDIA H200 GPU.

Abstract

Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6 to 16.8 speedup over the vanilla implementation and up to 1.85 over the state-of-the-art caching method, while preserving generation quality.
Paper Structure (30 sections, 2 equations, 8 figures, 15 tables, 1 algorithm)

This paper contains 30 sections, 2 equations, 8 figures, 15 tables, 1 algorithm.

Figures (8)

  • Figure 1: Confidence variation statistics using LLaDA-8B-Instruct. (a) uses a sample from the BBH dataset, while (b) and (c) present results using 100 samples from multiple datasets.
  • Figure 2: Hidden state variation in layer 10 using LLaDA-8B-Instruct. (a) uses a single sample from the BBH dataset, while (b) presents results using 100 samples from multiple datasets. The red vertical line in (a) separates prompt and output tokens, and the distribution in (b) includes only output tokens.
  • Figure 3: Illustration of ES-dLLM compared with the vanilla implementation and DualCache, assuming block-1 is under processing. The figure presents only 4 tokens per block, while the actual block length can be much larger (e.g., 32 or 64).
  • Figure 4: Ablation studies on importance estimation configurations using LLaDA-8B-Instruct.
  • Figure 5: Variation statistics of key, value, and query tensors in layer 10 using LLaDA-8B-Instruct. Left: single-sample heatmap from BBH; red line separates prompt and output tokens. Right: log-scale distribution for output tokens using 100 samples from multiple datasets.
  • ...and 3 more figures