Advancing Block Diffusion Language Models for Test-Time Scaling

Yi Lu; Deyang Kong; Jianing Wang; Linsen Guo; Xue Wang; Qi Guo; Tao Gui; Xuanjing Huang; Wei Ye; Shikun Zhang; Wei Wang

Advancing Block Diffusion Language Models for Test-Time Scaling

Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang, Qi Guo, Tao Gui, Xuanjing Huang, Wei Ye, Shikun Zhang, Wei Wang

TL;DR

This work tackles the challenge of test-time scaling in Block Diffusion Language Models (BDLMs) by introducing two complementary innovations: Bounded Adaptive Confidence Decoding (BACD), which adaptively controls denoising based on running confidence bounds, and Think Coarse, Critic Fine (TCCF), a two-stage decoding strategy that allocates large blocks for exploration and small blocks for refinement. It further strengthens BDLM practicality with Progressive Block Size Extension to safely scale block sizes during training. Empirical results across mathematics, code, and STEM benchmarks demonstrate substantial speedups (e.g., $1.71\times$ with BACD) and accuracy gains (e.g., up to $+11.7\%$ with TCCF) over strong baselines, with demonstrations of generalization to other BDLMs. Collectively, the approach provides a concrete path toward efficient, reliable long-chain reasoning with diffusion-based, block-wise generation in real-world settings.

Abstract

Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.

Advancing Block Diffusion Language Models for Test-Time Scaling

TL;DR

with BACD) and accuracy gains (e.g., up to

with TCCF) over strong baselines, with demonstrations of generalization to other BDLMs. Collectively, the approach provides a concrete path toward efficient, reliable long-chain reasoning with diffusion-based, block-wise generation in real-world settings.

Abstract

Paper Structure (45 sections, 11 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 45 sections, 11 equations, 10 figures, 7 tables, 2 algorithms.

Introduction
Preliminary
Block Diffusion Language Models
Sampling Algorithm for BDLMs
Static Confidence Decoding. nie2025largellada
Dynamic Confidence Decoding. wu2025fastdllmtrainingfreeaccelerationdiffusion
Method
Bounded Adaptive Confidence Decoding
Think Coarse, Critic Fine
Coarse Thinking.
Fine Critic.
Training with Large Block Sizes.
Experiment
Experiment Setup
Adaptation to Long CoT Reasoning BDLMs.
...and 30 more sections

Figures (10)

Figure 1: Performance and speed comparison of BDLMs. Our TDAR-8B-Thinking achieves $1.71\times$ speedup with BACD and +11.7% accuracy with TCCF compared to the best baselines.
Figure 2: Overview of our reasoning process. We use Bounded Adaptive Confidence Decoding to enable fast exploration with large block sizes, and apply small block sizes for fine-grained refinement.
Figure 3: Accuracy and Speed under different thresholds on AIME24 and Math500. Gold marker indicates our selected checkpoint.
Figure 4: Average token confidence under different confidence thresholds.
Figure 5: Error type analysis under different confidence thresholds for BACD and Dynamic Confidence.
...and 5 more figures

Advancing Block Diffusion Language Models for Test-Time Scaling

TL;DR

Abstract

Advancing Block Diffusion Language Models for Test-Time Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (10)