Table of Contents
Fetching ...

d2: Improved Techniques for Training Reasoning Diffusion Language Models

Guanghan Wang, Yair Schiff, Gilad Turok, Volodymyr Kuleshov

TL;DR

This paper introduces d2, a principled RL framework for diffusion language models that uses trajectory-likelihood based policy gradients. It presents two specialized estimators, d2-StepMerge and d2-AnyOrder, to efficiently estimate trajectory likelihood under masked diffusion and analyzes any-order causality as a crucial property for diffusion-based reasoning. Empirically, d2 achieves state-of-the-art reasoning performance on Sudoku, Countdown, GSM8K, and MATH500 without supervised chain-of-thought fine-tuning, and demonstrates strong toxicity steering capabilities in a red-teaming setup. The work advances the practicality of RL-only post-training for DLMs and provides theoretical guarantees on estimator accuracy and applicability. Overall, d2 furnishes a scalable, theoretically grounded path to enhancing reasoning in diffusion-based language models while clarifying the role of any-order decoding in this context.

Abstract

While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).

d2: Improved Techniques for Training Reasoning Diffusion Language Models

TL;DR

This paper introduces d2, a principled RL framework for diffusion language models that uses trajectory-likelihood based policy gradients. It presents two specialized estimators, d2-StepMerge and d2-AnyOrder, to efficiently estimate trajectory likelihood under masked diffusion and analyzes any-order causality as a crucial property for diffusion-based reasoning. Empirically, d2 achieves state-of-the-art reasoning performance on Sudoku, Countdown, GSM8K, and MATH500 without supervised chain-of-thought fine-tuning, and demonstrates strong toxicity steering capabilities in a red-teaming setup. The work advances the practicality of RL-only post-training for DLMs and provides theoretical guarantees on estimator accuracy and applicability. Overall, d2 furnishes a scalable, theoretically grounded path to enhancing reasoning in diffusion-based language models while clarifying the role of any-order decoding in this context.

Abstract

While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).

Paper Structure

This paper contains 33 sections, 10 theorems, 56 equations, 11 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

At $\theta = \theta_{\text{old}}$, $\nabla_\theta\mathcal{J}(\theta)$ admits the following decomposition over latent diffusion steps:

Figures (11)

  • Figure 1: Benchmark performance of different RL post-training algorithms applied to LLaDA-8B-Instruct nie2025large. Even without supervised finetuning (SFT), d2 outperforms d1 zhao2025d1 with SFT and wd1 tang2025wd1 on all four reasoning benchmarks.
  • Figure 2: Illustration of our proposed StepMerge strategy. In d2-StepMerge, we cut the trajectory evenly into $N$ time segments and evaluate the likelihood for each segment together. Newly decoded tokens on which we compute the likelihood at the corresponding model forward pass are highlighted.
  • Figure 3: $D_N(\pi_{\textnormal{LLaDA}})$ for varying $N$.
  • Figure 4: Illustration of different DLM decoding strategies. We depict attention with query tokens (one layer up) attending to keys/values (one layer below) via an undirected connected line. The output at each position is depicted with a directed arrow. "pos" refers to positional encoding index. We use a three token example where the decoding order is "for$\rightarrow$d2$\rightarrow$RL". At each time step, newly added attention relations in any-order decoding are highlighted with red line markers.
  • Figure 5: Illustration of one-shot trajectory likelihood evaluation. Continuation of the example from Figure \ref{['fig:dlm_decoding']}.
  • ...and 6 more figures

Theorems & Definitions (25)

  • Theorem 3.1
  • Remark 3.2
  • Corollary 3.3
  • Remark 3.4
  • Theorem 4.1: Approximation Error Bound
  • Definition 4.2
  • Theorem 4.3
  • proof
  • Definition A.1: Timing and Value Decomposition
  • Lemma A.2: Timing and Value Decomposition of KL Divergence
  • ...and 15 more