Table of Contents
Fetching ...

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang

TL;DR

This work tackles the latency bottleneck of diffusion-based LLMs by introducing SlowFast Sampling, a dynamic two-stage decoding strategy guided by the Certainty, Convergence, and Positional principles. By alternating between exploratory and accelerated decoding and leveraging region-based caching, the method achieves large inference speedups while preserving generation quality. Extensive experiments on LLaDA 8B and Dream 7B demonstrate substantial throughput gains, including up to $34.22\times$ with dLLM-Cache, and even outperform autoregressive LLaMA3 8B in some settings. The approach is shown to be compatible with caching and scalable across benchmarks, highlighting its practical potential for fast, high-quality generation with dLLMs.

Abstract

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

TL;DR

This work tackles the latency bottleneck of diffusion-based LLMs by introducing SlowFast Sampling, a dynamic two-stage decoding strategy guided by the Certainty, Convergence, and Positional principles. By alternating between exploratory and accelerated decoding and leveraging region-based caching, the method achieves large inference speedups while preserving generation quality. Extensive experiments on LLaDA 8B and Dream 7B demonstrate substantial throughput gains, including up to with dLLM-Cache, and even outperform autoregressive LLaMA3 8B in some settings. The approach is shown to be compatible with caching and scalable across benchmarks, highlighting its practical potential for fast, high-quality generation with dLLMs.

Abstract

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63 speedup on LLaDA with minimal accuracy drop, and up to 34.22 when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

Paper Structure

This paper contains 13 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of the Three Golden Principles for Sampling in Diffusion-based LLMs. (a) Demonstrates the Convergence Principle: as decoding proceeds, the confidence values of tokens converge to stable values, either high (e.g., 12th token to 0.98) or low (e.g., 1st token to 0.25). (b) Visualizes the evolution of the Confidence Map over 256 diffusion steps on GSM8K of LLaDA Base. High-confidence tokens (in deep red) emerge progressively and are preferentially decoded (The Certainty Principle), while selection tends to cluster in contiguous regions (The Positional Principle), enabling cache reuse and efficient acceleration.
  • Figure 2: Throughput and Accuracy Comparison on GPQA (8-shot, Length=1024) with LLaDA and Our Proposed Methods. We evaluate LLaDA under three settings: (1) vanilla decoding, (2) with our proposed SlowFast Sampling, and (3) SlowFast Sampling further enhanced by dLLM-Cache. Compared to the vanilla setting, SlowFast Sampling alone achieves a 15.63$\times$ speedup while maintaining comparable accuracy. With dLLM-Cache, throughput improves further to 54.75 tokens/sec (up to 34.22$\times$ speedup), with only minor drops in accuracy. This demonstrates the strong efficiency gains and flexibility enabled by our dynamic strategy.
  • Figure 3: Overview of the SlowFast Sampling Pipeline: From Exploratory to Accelerated Decoding. The method alternates between a Slow (Exploratory) stage and a Fast (Accelerated) stage for efficient token generation. In the Slow phase (left), the model conducts cautious decoding by selecting top-$k$ high-confidence tokens per step while continuously predicting the End Point of Convergence and calculating confidence variance across a history window. Once variance drops below threshold (e.g., $0.22 < 0.23$), the corresponding region $[s_{cycle}, e_{cycle}]$ is considered stable. In the Fast phase (right), this stable span is decoded in parallel with aggressive unmasking of high-confidence tokens, while tokens beyond the span are temporarily skipped and their results cached for reuse. This alternating structure reduces redundant computation and accelerates decoding while maintaining output quality.
  • Figure 4: Effect of Confidence Thresholds.$\tau_{min\_conf}$ controls the exploratory range, and $\tau_{high\_conf}$ balances accuracy and speed during fast decoding. Other hyperparameters follow the default settings in Section \ref{['sec:main-results']}.
  • Figure 5: The sensitivity study on hyper-parameters in the stability-Check. Accuracy and TPS vary with $K_{max}$, $\sigma^2_{stable}$, and $W_{hist}$. The chosen defaults ($K_{max}=8$, $\sigma^2_{stable}=1.0$, $W_{hist}=2$) offer strong speed-quality trade-offs.