Table of Contents
Fetching ...

dVoting: Fast Voting for dLLMs

Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang

TL;DR

This paper tackles the high inference cost of test-time scaling in diffusion large language models (dLLMs). It introduces dVoting, a training-free voting strategy that uses remask sampling and token-consistency analysis to iteratively refine uncertain tokens and aggregate multiple candidate generations via voting. The authors provide empirical evidence of a key observation: repeated tokens across samples indicate redundancy, quantified by the Non-Unique Position Rate ($NUPR@k$), and show that focusing sampling on uncertain positions yields consistent performance gains across GSM8K, MATH500, ARC-C, and MMLU with modest overhead. The approach achieves a favorable performance–efficiency trade-off, outperforming baselines like HEX and RFG, and generalizes to RL-enhanced models, offering a practical baseline for efficient test-time scaling in dLLMs and enabling broader deployment under limited computational budgets.

Abstract

Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting

dVoting: Fast Voting for dLLMs

TL;DR

This paper tackles the high inference cost of test-time scaling in diffusion large language models (dLLMs). It introduces dVoting, a training-free voting strategy that uses remask sampling and token-consistency analysis to iteratively refine uncertain tokens and aggregate multiple candidate generations via voting. The authors provide empirical evidence of a key observation: repeated tokens across samples indicate redundancy, quantified by the Non-Unique Position Rate (), and show that focusing sampling on uncertain positions yields consistent performance gains across GSM8K, MATH500, ARC-C, and MMLU with modest overhead. The approach achieves a favorable performance–efficiency trade-off, outperforming baselines like HEX and RFG, and generalizes to RL-enhanced models, offering a practical baseline for efficient test-time scaling in dLLMs and enabling broader deployment under limited computational budgets.

Abstract

Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting
Paper Structure (17 sections, 2 equations, 6 figures, 8 tables)

This paper contains 17 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of dVoting. For each prompt, our dVoting preserves consistent tokens in previous generations and remasks the remaining tokens to initiate subsequent sampling, and terminates the process early when candidate answers satisfy consistent criteria.
  • Figure 2: Empirical study on LLaDA-8B-Instruct. We report the performance of five strategies: (1) Pass@1; (2) Pass@3; (3) Pass@5; (4) Majority voting ($5$ samples), which denotes the results of standard test-time scaling strategies; and (5) d1, which represents the performance of RL-enhanced models. We report the results of GSM8K, MATH500, and ARC-C in (a), (b), and (c), respectively.
  • Figure 3: (a) We report the distribution of voting consistency levels across different sample categories, defined by the correctness of the baseline and voting predictions. (b) We present two cases illustrating token-level redundancy in dLLM sampling ($5$ samples), drawn from GSM8K and ARC-C, respectively (zoom-in for more details).
  • Figure 4: Comparison of performance-efficiency trade-off between dVoting and other test-time scaling baselines. We present the results on the LLaDA model for GSM8K and MATH500 in (a) and (b), respectively. We present the results on the Dream model for GSM8K, MATH500, and ARC-C under $128$ generation length in (c). We mark our method with a star to distinguish it from other methods.
  • Figure 5: Case visualization. We present two cases where the original model yields a correct and an incorrect answer, respectively. Prompt-2 fails by incorrectly assuming a five-day week, while our method iteratively refines this part to produce the correct answer.
  • ...and 1 more figures