Table of Contents
Fetching ...

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, Chongxuan Li

TL;DR

This work tackles the mismatch between reinforcement learning objectives and diffusion large language models by proposing ESPO, a principled sequence-level RL framework that treats an entire sequence as a single action and uses the ELBO as a tractable proxy for sequence likelihood. ESPO couples this with a stabilized, per-token ratio normalization and a robust KL-divergence estimator, enabling stable large-scale training. Empirical results across mathematics, coding, and planning tasks show ESPO outperforms token-level baselines, with especially large gains on planning tasks that require global consistency. The paper argues that sequence-level optimization is a principled and effective paradigm for RL in diffusion language models and provides open-source code.

Abstract

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

TL;DR

This work tackles the mismatch between reinforcement learning objectives and diffusion large language models by proposing ESPO, a principled sequence-level RL framework that treats an entire sequence as a single action and uses the ELBO as a tractable proxy for sequence likelihood. ESPO couples this with a stabilized, per-token ratio normalization and a robust KL-divergence estimator, enabling stable large-scale training. Empirical results across mathematics, coding, and planning tasks show ESPO outperforms token-level baselines, with especially large gains on planning tasks that require global consistency. The paper argues that sequence-level optimization is a principled and effective paradigm for RL in diffusion language models and provides open-source code.

Abstract

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.

Paper Structure

This paper contains 54 sections, 30 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Training performance on the Sudoku task under different action space (Token-level vs. Sequence-level) and likelihood approximations (Mean-field vs. ELBO). Our method (blue) combines a sequence-level action space with an ELBO approximation, yielding the most stable and highest performance.
  • Figure 2: Training performance on the Sudoku task with different KL-divergence estimators. The $k_2$ estimator (blue) achieves stable and superior performance. The $k_1$ estimator (orange) is highly unstable and collapses, while the $k_3$ estimator (green) stagnates.
  • Figure 3: Training dynamics with different methods for Countdown and Sudoku tasks on LLaDA-8b-Instruct.
  • Figure 4: Ablation study on the number of Monte Carlo samples for Countdown and Sudoku. We evaluate training performance with different MC sample counts (1, 2, 4), showing the effect of increased sampling on reward optimization.
  • Figure 5: Ablation study on the policy update values ($\mu$) for Countdown and Sudoku. The reward curves illustrate performance across a range of $\mu$ values. While smaller values (e.g., 8, 12) lead to faster initial convergence on Sudoku, the method is robust and achieves similarly high rewards across all settings for Countdown and Sudoku tasks.
  • ...and 3 more figures