Table of Contents
Fetching ...

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi

TL;DR

This work reveals that diffusion-based LLMs implicitly learn a mixture of semi-autoregressive experts, with different generation orders exposing distinct specializations. Because a single test-time decoding schedule can underutilize this latent ensemble, the authors propose HEX, a training-free method that ensembles over diverse semi-autoregressive block schedules via majority voting to achieve robust test-time scaling. Across GSM8K, MATH, ARC-C, and TruthfulQA, HEX attains state-of-the-art or near-state-of-the-art results without retraining and even surpasses some fine-tuned baselines, demonstrating a practical, compute-tunable approach to inference. The findings establish a new paradigm for inference in diffusion LLMs, emphasizing the importance of decoding order and schedule diversity in unlocking the model’s latent reasoning capabilities.

Abstract

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

TL;DR

This work reveals that diffusion-based LLMs implicitly learn a mixture of semi-autoregressive experts, with different generation orders exposing distinct specializations. Because a single test-time decoding schedule can underutilize this latent ensemble, the authors propose HEX, a training-free method that ensembles over diverse semi-autoregressive block schedules via majority voting to achieve robust test-time scaling. Across GSM8K, MATH, ARC-C, and TruthfulQA, HEX attains state-of-the-art or near-state-of-the-art results without retraining and even surpasses some fine-tuned baselines, demonstrating a practical, compute-tunable approach to inference. The findings establish a new paradigm for inference in diffusion LLMs, emphasizing the importance of decoding order and schedule diversity in unlocking the model’s latent reasoning capabilities.

Abstract

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

Paper Structure

This paper contains 21 sections, 6 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: Overview of our proposed HEX framework. Left: HEX leverages multiple semi-autoregressive hidden experts, guided by different masking schedules, to produce concatenated outputs and a final answer. Right: HEX outperforms Top-K, Top-K margin icml2025best and Random expert selection strategies llada on reasoning tasks (GSM8K, MATH, ARC-C), surpassing the training-based GRPO baseline (d1) d1.
  • Figure 2: Random vs. Top-K margin inference on GSM8K. Left: Random decoding achieves 50.87% accuracy, while Right: Top‑K margin only 24.72%. For each method, the text box shows the result at the last unmasking step. Top-K margin generates output tokens in reverse, from the end toward the beginning, and exhibits a catastrophic collapse in which all tokens are [AfterEoT] (shown in red). Over 55.5% of top‑K margin runs suffered this collapse, yielding very low accuracy. These failures cast doubt on methods that rely solely on token confidence.
  • Figure 3: The distribution of the 4th token 'Bell' in the output sequence changes significantly depending on the $2^3$ masking conditions applied to the previous three tokens: 'The', 'inventor', 'was'. The star mark indicates the highest confidence for each distribution generated by $U$. Some masking conditions (violet and green) produce collapsed distribution: "Bell Bell was invented." (ungrammatical sentence), "The telephone was invented." (missing target information), respectively.
  • Figure 4: When asked about the 2024 Turing Award winners, names other than the actual recipients (such as Michael or David) might be generated due to different block sizes, which in turn risks producing incorrect information in the subsequent token sequence. However, if we generate outputs with various block sizes and then select the most frequently produced answer, that answer is more likely to be correct, since it was probably derived through a valid reasoning (Andrew) during the inference process.
  • Figure 5: HEX improves reasoning accuracy. On LLaDA-8B-Instruct, HEX outperforms training-free baselines (Random, Top-$k$, Top-$k$-margin) on GSM8K, MATH, ARC-C, and TruthfulQA. In GSM8K, MATH, ARC-C, it even outperforms the model trained with GRPO without any training.
  • ...and 9 more figures