Table of Contents
Fetching ...

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn

TL;DR

DyLLM is presented, a training-free inference framework that accelerates decoding by selectively computing only salient tokens between adjacent denoising steps, and achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

Abstract

Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

TL;DR

DyLLM is presented, a training-free inference framework that accelerates decoding by selectively computing only salient tokens between adjacent denoising steps, and achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

Abstract

Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
Paper Structure (42 sections, 2 theorems, 17 equations, 10 figures, 4 tables, 4 algorithms)

This paper contains 42 sections, 2 theorems, 17 equations, 10 figures, 4 tables, 4 algorithms.

Key Result

Proposition 3.1

Let $W_o \in \mathbb{R}^{d \times d}$ be the output projection matrix, and $\alpha \in \mathbb{R}^+$ be a positive scaling factor. The composite operation of linear projection followed by RMSNorm satisfies:

Figures (10)

  • Figure 1: Runtime breakdown of autoregressive decoding with vLLM vllm, original diffusion LLM implementation, and DyLLM on random GSM8K 5-shot prompts ($\text{batch size}=16$). Original diffusiosn LLM repeats the full steps, whereas DyLLM recomputes only salient tokens, reducing the dominant per-step overhead.
  • Figure 2: Distribution of temporal cosine similarity $s_{t,l}$ across layers ($l\in\{8, 16, 24, 32\}$) obtained using GSM8K 5-shot prompts.
  • Figure 3: Approximate attention: exact attention with full KV cache is done only for salient tokens; for non-salient ones, we use column sparse attention context operation to approximate updates.
  • Figure 4: Error of approximate attention compared to the exact attention measured by cosine similarity.
  • Figure 5: Accuracy results varying $\tau$ across GSM8K and MBPP datasets. Accuracy generally declines as the threshold is lowered.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Proposition 3.1: Scale Invariance under Linear Projection
  • Proposition 3.2: Error Bound via Directional Alignment
  • proof
  • proof