Table of Contents
Fetching ...

Stabilizing Reinforcement Learning for Diffusion Language Models

Jianyuan Zhong, Kaibo Wang, Ding Ding, Zijin Feng, Haoli Bai, Yang Xiang, Jiacheng Sun, Qiang Xu

TL;DR

StableDRL is proposed, a reformulation of GRPO tailored for dLLMs that uses unconditional clipping to suppress outlier-induced spikes and self-normalization to constrain updates within the convex hull of per-sample gradients, and is extended to block-wise diffusion models via a staircase attention mechanism.

Abstract

Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.

Stabilizing Reinforcement Learning for Diffusion Language Models

TL;DR

StableDRL is proposed, a reformulation of GRPO tailored for dLLMs that uses unconditional clipping to suppress outlier-induced spikes and self-normalization to constrain updates within the convex hull of per-sample gradients, and is extended to block-wise diffusion models via a staircase attention mechanism.

Abstract

Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.
Paper Structure (70 sections, 17 theorems, 96 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 70 sections, 17 theorems, 96 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Lemma 3.1

In inner step $i$ of GRPO, for any threshold $H>0$, there exists a lower bound $P_i(H)\in(0,1)$ such that Moreover, under the common tail-envelope condition on $\Delta\eta_{i,j}$, the bound $P_i(H)$ can be chosen as the same nondecreasing function of $D_i$ for all inner steps (see App. app:grpo_instability_loop_proof).

Figures (8)

  • Figure 1: StableDRL is the first method to enable stable full-parameter RL training on both full-attention and block dLLMs, better unlocking reasoning capability for dLLMs. The left panel reports performance on full-attention dLLMs (LLaDA-8B nie2025llada). Based on Table 1, Best Prior corresponds to WD1 tang2025wd1, and Best SOTA corresponds to the best performance between ESPO and SPG ou2025espowang2025spg for each task. The right panel demonstrates results for block diffusion models (SDAR-8B cheng2025sdar).
  • Figure 2: (a) Training instability. Naive integration of noisy importance ratios into GRPO leads to severe instability under full-parameter RL training with dLLMs. Notably, reward collapse occurs even with Policy Gradient, where the importance ratio is fixed at 1. (b) Instability loop. Estimation noise triggers gradient spikes and policy drift, creating a self-reinforcing cycle that amplifies the variance of future importance ratios. (c) StableDRL. To address this, we propose a reformulated GRPO for noisy importance ratios. By employing unconditional clipping and self-normalization, StableDRL effectively breaks the instability loop.
  • Figure 3: Staircase Attention for Efficient Proxy Estimation. To evaluate the ELBO for block diffusion in a single pass ($O(1)$), we use a dual-stream construction. The Clean Context (top rows) provides immutable history. The Corrupted Target stream (bottom rows) uses a "staircase" mask ($M_{\textsc{stair}}$, bottom-left) to attend to valid history without peeking at the ground truth of the current block. The target self-attention ($M_{\textsc{intra}}$, bottom-right) is block-diagonal, ensuring independent parallel denoising.
  • Figure 4: Verification of Instability Mechanisms across Methods. Left Column (GRPO): Unbounded drift (bottom) fuels an accelerating spike rate (middle), causing reward collapse (top). Middle Column (Unconditional Clipping): Clipping saturates the drift (bottom) but induces a high-frequency, stochastic spike regime (middle) that destabilizes learning (top). Right Column (StableDRL): Our method maintains a low, stable spike rate (middle) decoupled from drift (bottom), resulting in monotonic reward improvement (top).
  • Figure 5: Robustness to Proxy Noise: The "Exploding Weight" Stress Test (GSM8K). We compare training stability under standard conditions ("Normal", solid lines) versus an adversarial regime where importance weight variance is artificially amplified ("Exploding", dashed lines; see App. \ref{['app:stress_test']}). (Left) Reward Trajectories: StableDRL (Green) demonstrates invariant stability, maintaining monotonic improvement in both regimes. In contrast, ESPO (Orange) suffers immediate, noise-accelerated collapse, confirming its sensitivity to ratio outliers. SPG (Blue) degrades in both settings, indicating that avoiding ratios (to reduce variance) fatally exposes the model to off-policy bias. (Right) Gradient Norm Density: Visualizing the failure mechanism. StableDRL maintains a condensed, low-variance gradient distribution. Conversely, ESPO exhibits a heavy right tail of explosive updates (log-norm $>3$), confirming that the "Asymmetric Clipping Failure" allows noise spikes to propagate unchecked.
  • ...and 3 more figures

Theorems & Definitions (31)

  • Lemma 3.1: Informal, existence of drift-dependent spike probability
  • Theorem 3.2: Informal, self-reinforcing instability loop
  • Lemma 3.3: Informal, existence of hitting probability
  • Theorem 3.4: Informal, self-reinforcing hitting loop
  • Theorem 3.5: StableDRL
  • Theorem B.1: GRPO drift--spike feedback loop
  • Theorem B.2: Boundary saturation under two-sided clipping
  • Theorem B.3: Self-normalization removes the random group-scale factor
  • Remark B.4
  • Lemma B.5: Ratio exceedance identity and drift monotonicity
  • ...and 21 more