Table of Contents
Fetching ...

RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance

Tianlang Chen, Minkai Xu, Jure Leskovec, Stefano Ermon

TL;DR

This work tackles the challenge of enhancing reasoning in diffusion-based large language models without resorting to additional reward-model training. It introduces Reward-Free Guidance (RFG), which defines a trajectory-level reward as a log-likelihood ratio between a policy dLLM and a reference dLLM, and derives per-step process rewards to guide the denoising trajectory. The authors prove that their stepwise guidance yields a reward-guided sampling distribution and demonstrate strong, training-free improvements across math reasoning and code generation benchmarks using diverse model families and post-training methods. The approach is practical and broadly applicable, achieving notable gains without extra reward data or tuning, and offering a flexible framework for future alignment and reasoning enhancements in diffusion-based systems.

Abstract

Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models.

RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance

TL;DR

This work tackles the challenge of enhancing reasoning in diffusion-based large language models without resorting to additional reward-model training. It introduces Reward-Free Guidance (RFG), which defines a trajectory-level reward as a log-likelihood ratio between a policy dLLM and a reference dLLM, and derives per-step process rewards to guide the denoising trajectory. The authors prove that their stepwise guidance yields a reward-guided sampling distribution and demonstrate strong, training-free improvements across math reasoning and code generation benchmarks using diverse model families and post-training methods. The approach is practical and broadly applicable, achieving notable gains without extra reward data or tuning, and offering a flexible framework for future alignment and reasoning enhancements in diffusion-based systems.

Abstract

Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models.

Paper Structure

This paper contains 45 sections, 2 theorems, 70 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.1

Given a diffusion trajectory-level reward that is parameterized as the log-likelihood ratio of two dLLMs, i.e., $r_\theta({\mathbf{x}}_{0:T}):= \beta \log \frac{p_\theta({\mathbf{x}}_{0:T})}{p_\text{ref}({\mathbf{x}}_{0:T})}$. Define $Q_\theta^t({\mathbf{x}}_{t-1}, {\mathbf{x}}_{t:T}):= \sum_{i=t}^{ where $\beta$ is a hyperparameter for weighting the reward function.

Figures (3)

  • Figure 1: RFG consistently achieves significant improvements across all four tasks and various model types with different post-training methods.
  • Figure 2: Sampling illustration for original policy model and RFG.
  • Figure 3: Accuracy of RFG under varying guidance strength $w$ across four benchmarks: GSM8K and MATH-500 for mathematical reasoning using d1-LLaDA, and HumanEval and MBPP for code generation using Dream-Instruct and DiffuCoder, respectively. We observe that RFG consistently improves performance over a broad range of guidance strength.

Theorems & Definitions (3)

  • Proposition 3.1
  • Proposition \ref{prop:prm}
  • proof