DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Younjoo Lee; Junghoo Lee; Seungkyun Dan; Jaiyoung Park; Jung Ho Ahn

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn

TL;DR

DyLLM is presented, a training-free inference framework that accelerates decoding by selectively computing only salient tokens between adjacent denoising steps, and achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

Abstract

Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

TL;DR

Abstract

Paper Structure (42 sections, 2 theorems, 17 equations, 10 figures, 4 tables, 4 algorithms)

This paper contains 42 sections, 2 theorems, 17 equations, 10 figures, 4 tables, 4 algorithms.

Introduction
Language Model Paradigms
Autoregressive Sampling: Sequential and One-Pass
Diffusion-based Sampling: Parallel and Multi-Pass
Semi-Autoregressive Decoding: Locally Parallel, Globally Sequential
The Efficiency Dilemma: Decoding Parallelism vs. Caching
DyLLM
Characterizing Temporal Sparsity
Salient Token Selection: Identifying Semantic Deltas
Propagation of Semantic Deltas Across Layers
Impact of Saliency-Aware Inference on Accuracy
Response-only Step
Evaluation
Experimental Setup
Main Results
...and 27 more sections

Key Result

Proposition 3.1

Let $W_o \in \mathbb{R}^{d \times d}$ be the output projection matrix, and $\alpha \in \mathbb{R}^+$ be a positive scaling factor. The composite operation of linear projection followed by RMSNorm satisfies:

Figures (10)

Figure 1: Runtime breakdown of autoregressive decoding with vLLM vllm, original diffusion LLM implementation, and DyLLM on random GSM8K 5-shot prompts ($\text{batch size}=16$). Original diffusiosn LLM repeats the full steps, whereas DyLLM recomputes only salient tokens, reducing the dominant per-step overhead.
Figure 2: Distribution of temporal cosine similarity $s_{t,l}$ across layers ($l\in\{8, 16, 24, 32\}$) obtained using GSM8K 5-shot prompts.
Figure 3: Approximate attention: exact attention with full KV cache is done only for salient tokens; for non-salient ones, we use column sparse attention context operation to approximate updates.
Figure 4: Error of approximate attention compared to the exact attention measured by cosine similarity.
Figure 5: Accuracy results varying $\tau$ across GSM8K and MBPP datasets. Accuracy generally declines as the threshold is lowered.
...and 5 more figures

Theorems & Definitions (4)

Proposition 3.1: Scale Invariance under Linear Projection
Proposition 3.2: Error Bound via Directional Alignment
proof
proof

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

TL;DR

Abstract

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (4)