Table of Contents
Fetching ...

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu

TL;DR

This paper tackles the core difficulty of aligning diffusion-based language models with rewards due to intractable log-likelihoods. It introduces Sandwiched Policy Gradient (SPG), which maximizes a lower bound on positive-reward sequences (via ELBO) while minimizing an upper bound on negative-reward sequences (via a tractable EUBO derived from Rényi bounds), augmented by a block-wise masking strategy and a mixture of bounds to reduce gradient variance. The authors provide theoretical justifications for the mixture’s variance reduction and demonstrate through extensive experiments on GSM8K, MATH500, Countdown, and Sudoku that SPG outperforms ELBO-based RL baselines and achieves state-of-the-art results among RL methods for diffusion LMs. Practical validations include ablations on components, hyperparameters, and inference strategies, indicating robust performance across settings and highlighting the approach’s potential for scalable, reward-driven training of dLLMs. Overall, SPG offers a principled, bias-reduced framework for RL in diffusion-based language models with demonstrated improvements on multiple reasoning benchmarks and strong resilience to inference strategies."

Abstract

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

TL;DR

This paper tackles the core difficulty of aligning diffusion-based language models with rewards due to intractable log-likelihoods. It introduces Sandwiched Policy Gradient (SPG), which maximizes a lower bound on positive-reward sequences (via ELBO) while minimizing an upper bound on negative-reward sequences (via a tractable EUBO derived from Rényi bounds), augmented by a block-wise masking strategy and a mixture of bounds to reduce gradient variance. The authors provide theoretical justifications for the mixture’s variance reduction and demonstrate through extensive experiments on GSM8K, MATH500, Countdown, and Sudoku that SPG outperforms ELBO-based RL baselines and achieves state-of-the-art results among RL methods for diffusion LMs. Practical validations include ablations on components, hyperparameters, and inference strategies, indicating robust performance across settings and highlighting the approach’s potential for scalable, reward-driven training of dLLMs. Overall, SPG offers a principled, bias-reduced framework for RL in diffusion-based language models with demonstrated improvements on multiple reasoning benchmarks and strong resilience to inference strategies."

Abstract

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Paper Structure

This paper contains 68 sections, 7 theorems, 48 equations, 9 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Assume the forward denoising process has $T$ steps with a monotonic schedule $\alpha_t$. For any $\beta \geq 1$ and a sequence $\bm{x}_{1:n}$, we have: where $C(T)\vcentcolon= \mathbbm{1}(\beta<n) \cdot \frac{1}{\beta} \log \mathbb{E}_{\bm{z}_{1:T} \sim q(\cdot \mid \bm{x})} [q(\bm{z}_{1:T}\mid\bm{x})^{-n}]$ is a constant independent of $\bm{\bm{\theta}}$.

Figures (9)

  • Figure 1: Test accuracy of SPG and baseline methods on four mathematical and logical reasoning benchmarks. All methods are evaluated with a generation length of 256 in 128 denoising steps. Full results are provided in \ref{['tab:main_results']}.
  • Figure 2: The training process of SPG for MDLM. Left: From a prompt $\bm{c}$, we generate responses $\{\bm{x}^{j}\}_{j=1}^g$. We then maximize a lower bound on the likelihood $\pi_{\bm{\theta}}(\bm{x}^j\mid\bm{c})$ for high-reward responses while minimizing an upper bound for low-reward ones. Right: The upper/lower bound of likelihood is estimated via Monte Carlo using a block-wise masking strategy, where a random block is selected for masking, with earlier blocks kept clean and later blocks fully masked. The example shows a sequence of length 9 with a block size of 3, where the current generation block is highlighted in yellow.
  • Figure 3: Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO. SPG consistently leads to faster convergence and higher reward level. We report mean and standard deviation over a rolling window of 50 steps.
  • Figure 4: Reward dynamics of different log-likelihood estimation methods for negative advantage traces on Sudoku. SPG w/ Mixture leads to both fast convergence and high rewards.
  • Figure 5: (a)-(d): ablations on the effect of $\beta$ in the upper bound; (e)-(f): ablations on the mixture coefficient $\omega$. The best performed $\beta\geq 1$ and $\omega \in[0,1]$ are marked by triangle in each setting.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Theorem 1: Evidence Upper Bound for Masked Diffusion
  • Corollary 1
  • Proposition 1: Optimal Mixture Strictly Reduces Variance
  • Lemma 1: Rényi Variational Bound; renyi1961measuresvan2014renyi
  • Theorem 1: Evidence Upper Bound for Masked Diffusion
  • proof
  • Corollary 1
  • Proposition 1: Optimal Mixture Strictly Reduces Variance
  • proof