Table of Contents
Fetching ...

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, Hongxiang Fan

TL;DR

This work tackles the inefficiencies of fixed-block semi-autoregressive decoding in diffusion LLMs, notably late decoding overhead and premature decoding errors. It introduces AdaBlock-dLLM, a training-free, semantic-aware scheduler that adaptively tunes block size $B$ at runtime by aligning it with semantic steps and delimiter signals, guided by confidence dynamics and a volatility band. Across multiple dLLMs and benchmarks, AdaBlock-dLLM yields up to $5.3\%$ accuracy improvements under the same throughput budget, with pronounced gains when KV caching is used, and maintains competitive throughput. The results highlight the value of semantics-aware scheduling for diffusion-based generation and point to potential future training objectives that preserve context more effectively during decoding.

Abstract

Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

TL;DR

This work tackles the inefficiencies of fixed-block semi-autoregressive decoding in diffusion LLMs, notably late decoding overhead and premature decoding errors. It introduces AdaBlock-dLLM, a training-free, semantic-aware scheduler that adaptively tunes block size at runtime by aligning it with semantic steps and delimiter signals, guided by confidence dynamics and a volatility band. Across multiple dLLMs and benchmarks, AdaBlock-dLLM yields up to accuracy improvements under the same throughput budget, with pronounced gains when KV caching is used, and maintains competitive throughput. The results highlight the value of semantics-aware scheduling for diffusion-based generation and point to potential future training objectives that preserve context more effectively during decoding.

Abstract

Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.

Paper Structure

This paper contains 40 sections, 5 equations, 11 figures, 2 tables, 4 algorithms.

Figures (11)

  • Figure 1: Illustrative examples of two key issues (left) and how they can be overcome with AdaBlock-dLLM (right). A real case study is provided in Appendix \ref{['app:case_study_inaccuracy']}.
  • Figure 2: Performance improvement over Fast-dLLM wu2025fast.
  • Figure 3: Confidence scores across sequence positions for LLaDA-8B-Base, evaluated on 100 samples from the GSM8K benchmark. The high confidence plateau expands as decoding progresses, while positions beyond the decoded prefix exhibit high variance.
  • Figure 4: Illustration of the high confidence plateau, the volatility band (VB), and the low confidence floor across three samples. Within VB, the distribution of confidence scores and the width of the band vary across samples.
  • Figure 5: Proportion of sampling steps affected by late decoding overhead and premature decoding error on GSM8K and HumanEval for fixed block sizes.
  • ...and 6 more figures