Table of Contents
Fetching ...

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

TL;DR

The paper tackles the suboptimality of fixed-block schedules in diffusion LLMs by introducing Dynamic Sliding Block (DSB), a training-free method that adapts the active decoding window to semantic difficulty. Coupled with DSB Cache, a KV-cache design tailored to sliding blocks, the approach aims to preserve causality while enabling aggressive parallelism, reducing premature low-confidence commitments and stale cache states. Empirical results across multiple models and benchmarks show consistent improvements in both generation quality and inference speed, highlighting a robust quality-speed frontier for dLLMs. The work offers a practical, model-agnostic scheduling framework with clear pathways for future training-time integration and broader inference optimizations.

Abstract

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

TL;DR

The paper tackles the suboptimality of fixed-block schedules in diffusion LLMs by introducing Dynamic Sliding Block (DSB), a training-free method that adapts the active decoding window to semantic difficulty. Coupled with DSB Cache, a KV-cache design tailored to sliding blocks, the approach aims to preserve causality while enabling aggressive parallelism, reducing premature low-confidence commitments and stale cache states. Empirical results across multiple models and benchmarks show consistent improvements in both generation quality and inference speed, highlighting a robust quality-speed frontier for dLLMs. The work offers a practical, model-agnostic scheduling framework with clear pathways for future training-time integration and broader inference optimizations.

Abstract

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.
Paper Structure (16 sections, 2 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 2 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Limitation of naive block scheduling. At the denoising step T, several positions outside the active block have high confidence (dark yellow) but cannot be decoded due to the fixed block constraint, delaying easy tokens near the boundary. At later steps (T+N), the method is forced to decode low-confidence positions inside the active block, which can lead to premature, incorrect commitments (e.g., decoding “six” instead of the ground-truth “zero”).
  • Figure 2: A brief teaser of Dynamic Sliding Block (DSB). At the denoising step T, the top row shows the confidence of masked positions (darker yellow indicates higher confidence). Global decodes the whole response simultaneously. Naive Block uses a fixed block, which can force early decoding of low-confidence positions inside the block and delay high-confidence positions outside it. In contrast, DSB employs a sliding block with dynamic size (red dashed box) to mitigate both issues.
  • Figure 3: Overview of DSB with DSB Cache. DSB maintains an active block (red) that slides and can change its size across denoising steps, enabling globally causal yet locally parallel decoding. DSB Cache caches KV states for positions outside the active block (shaded region), while jointly refreshing the active block and the immediately preceding prefix window (blue) to handle boundary instability introduced by block movement. A periodic global cache refresh performs full computation to re-synchronize cached states.
  • Figure 4: Ablation results of block length under parallel decoding. We compare DSB and naive block scheduling across different initial block lengths. Bars denote accuracy (left y-axis) and solid lines denote TPS (right y-axis). Results are reported on GSM8K with LLaDA-8B-Instruct.
  • Figure 5: Ablation results of generation length under parallel decoding. We compare DSB and the vanilla sampler across different generation lengths. Bars denote accuracy (left y-axis) and solid lines denote TPS (right y-axis). Results are reported on HumanEval, with the left and right panels corresponding to LLaDA-8B-Instruct and Dream-v0-Instruct-7B, respectively.
  • ...and 3 more figures