Table of Contents
Fetching ...

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen

TL;DR

The paper tackles unreliable reinforcement learning for diffusion language models by introducing d-TreeRPO, which uses tree-structured rollouts to produce granular, verifiable rewards and relies on single-time forward-pass estimates for probability transitions across decoding orders. It provides a theoretical bound showing that estimation error shrinks as model confidence increases and mitigates exploration-exploitation tension with a time-scheduled self-distillation loss that tightens determinism in later training. Empirically, d-TreeRPO achieves substantial gains on Sudoku, Countdown, GSM8K, and Math500 compared to multiple baselines, with ablations validating the effectiveness of self-distillation and the scheduling strategy. The work demonstrates that reliable policy optimization for diffusion LLMs is both practical and beneficial for real-world reasoning benchmarks.

Abstract

Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

TL;DR

The paper tackles unreliable reinforcement learning for diffusion language models by introducing d-TreeRPO, which uses tree-structured rollouts to produce granular, verifiable rewards and relies on single-time forward-pass estimates for probability transitions across decoding orders. It provides a theoretical bound showing that estimation error shrinks as model confidence increases and mitigates exploration-exploitation tension with a time-scheduled self-distillation loss that tightens determinism in later training. Empirically, d-TreeRPO achieves substantial gains on Sudoku, Countdown, GSM8K, and Math500 compared to multiple baselines, with ablations validating the effectiveness of self-distillation and the scheduling strategy. The work demonstrates that reliable policy optimization for diffusion LLMs is both practical and beneficial for real-world reasoning benchmarks.

Abstract

Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.

Paper Structure

This paper contains 17 sections, 30 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Performance comparison of d-TreeRPO with existing dLLM RL methods on four reasoning benchmarks. All methods are evaluated with a generation length of 256 in 128 denoising steps. All tasks adopt zero-shot evaluation with pass@1 scoring.
  • Figure 2: Overview of our proposed d-TreeRPO. By analyzing the RL objective, we identify two requirements for reliable policy optimization: granular/verifiable advantage signals and precise log-probability estimation. Our framework employs a tree-structured rollout that propagates verifiable outcome rewards bottom-up through the denoising hierarchy to establish verifiable step-wise advantages, coupled with single-time forward pass estimation of parent-to-child transition log-probabilities. Guided by theoretical analysis of prediction confidence errors, we further introduce a time-scheduled self-distillation mechanism that progressively sharpens model determinism in later training stages, ensuring more precise estimation and better convergence.
  • Figure 3: Training rewards on Sudoku task under different parameters $H$ and $B$.
  • Figure 4: Training curves for the Sudoku task.
  • Figure 5: Training dynamics comparison: d-TreeRPO vs. its reverse-scheduled variant.