Table of Contents
Fetching ...

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic

TL;DR

Problem: RL fine-tuning of diffusion-based LLMs is hindered by intractable likelihoods, forcing expensive approximations that introduce bias. Approach: wd1 replaces policy ratios with a weighted log-likelihood objective derived from reverse-KL regularization, using group-relative advantage to weight completions and a complementary negative term to penalize low-advantage samples. Contributions: formalizes the method, proves monotonic improvement, and shows that wd1 matches or exceeds existing RL methods without supervised fine-tuning. Impact: wd1 delivers up to 16% accuracy gains on reasoning benchmarks and reduces training cost and NFEs, making RL for dLLMs more scalable and practical.

Abstract

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias -- particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce $\mathtt{wd1}$, a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that $\mathtt{wd1}$, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. $\mathtt{wd1}$ delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method's implementation and R1-Zero-like training (no SFT), position $\mathtt{wd1}$ as a more effective and efficient method for applying RL to dLLMs reasoning.

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

TL;DR

Problem: RL fine-tuning of diffusion-based LLMs is hindered by intractable likelihoods, forcing expensive approximations that introduce bias. Approach: wd1 replaces policy ratios with a weighted log-likelihood objective derived from reverse-KL regularization, using group-relative advantage to weight completions and a complementary negative term to penalize low-advantage samples. Contributions: formalizes the method, proves monotonic improvement, and shows that wd1 matches or exceeds existing RL methods without supervised fine-tuning. Impact: wd1 delivers up to 16% accuracy gains on reasoning benchmarks and reduces training cost and NFEs, making RL for dLLMs more scalable and practical.

Abstract

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias -- particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce , a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that , without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method's implementation and R1-Zero-like training (no SFT), position as a more effective and efficient method for applying RL to dLLMs reasoning.

Paper Structure

This paper contains 22 sections, 3 theorems, 18 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let surrogate objective $L_{\pi_{\text{old}}}(\pi)= \eta(\pi_{\text{old}}) + \mathbb{E}_{s \sim \rho_{\pi_{\text{old}}}(\cdot),\ a \sim \pi(\cdot \mid s)} [ A^{\pi_{\text{old}}}(s, a) ]$, and $C={4 \max_{s, a, \pi} |A^{\pi}(s, a)| \gamma}/{(1-\gamma)^2}$, then $\forall k \in \mathbb{N}$:

Figures (4)

  • Figure 1: Performance on popular reasoning and planning benchmarks with the same base model LLaDA. For all models, we evaluate with maximum length $256$ and $512$, and report the best results. Our method wd1, outperforms both our reproduction of d1 and accuracies of LLaDA zhao2025d1.
  • Figure 2: Training Rewards Dynamics of wd1 and d1. Standard deviation reported over a rolling window of 50 steps. The average reward of the samples generated by wd1 increase faster than the baseline d1.
  • Figure 3: Completion lengths dynamics of wd1 and d1. In math problem-solving tasks (GSM8K and MATH500), our method demonstrates smaller completion lengths and better token efficiency.
  • Figure 4: Reward Dynamics. wd1 without SFT demonstrates better rewards in Sudoku and Countdown.

Theorems & Definitions (6)

  • Theorem 1: Policy Improvement Bound kakade2002approximatelyschulman2015trust
  • Remark 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof