Table of Contents
Fetching ...

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han

Abstract

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

Abstract

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.
Paper Structure (18 sections, 1 theorem, 11 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 1 theorem, 11 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Assume the group-relative advantage has variance $\mathbb{E}[(\hat{A}^{i}_t)^2]\le \sigma_A^2$, and the score function satisfies $\mathbb{E}[\|\nabla_\theta \log \pi_\theta(q_t^i \mid p, q_{<t}^i)\|^2]\le G^2$. The mini-batch gradient estimator $g(\theta)=\frac{1}{K}\sum_{i=1}^{K}\sum_{t=1}^{T_i}\ma

Figures (3)

  • Figure 1: Overview of MHPO. (a) Performance gain over baseline (Avg@32, %) on Qwen3-4B-Base. The bar chart displays the improvement ($\Delta$) achieved by different RL methods across five benchmarks. MHPO (navy) consistently achieves the largest gain on every benchmark. (b) Gradient norm trajectory during RL training. Baseline methods exhibit frequent gradient spikes, whereas MHPO maintains consistently low and stable gradient magnitudes throughout training, empirically confirming the bounded gradient multiplier guaranteed by our theoretical analysis. (c) Overall reward curve during training. MHPO attains higher reward earlier and sustains this advantage, while competing methods plateau or exhibit degradation in later stages.
  • Figure 2: Characteristics of the Log-Fidelity Modulator (LFM). (a) The LFM operator $\psi(r_t^i(\theta))$ with different values of $c$. The transformation exhibits reciprocal antisymmetry ($\psi(1/r_t^i(\theta)) = -\psi(r_t^i(\theta))$) around the on-policy anchor ($r_t^i(\theta)=1$) and smoothly bounds the output within the interval $[-c, c]$. The marked points at $r_t^i(\theta) = \tfrac{1}{2}$ and $r_t^i(\theta) = 2$ (for $c=1.0$) confirm $\psi(\tfrac{1}{2}) = -\psi(2)$. (b) The derivative of the LFM operator, demonstrating smooth gradient attenuation for extreme importance ratios while preserving local fidelity ($d\psi/dr = 1$) near the on-policy anchor.
  • Figure 3: Analysis of the Decoupled Hazard Penalty (DHP). (a) The penalty $\zeta(\psi)$ with different $(\lambda, k)$ configurations. The default asymmetric setting ($\lambda_+=1.0, k_+=1.5, \lambda_-=0.8, k_-=2.0$) applies stronger suppression for negative shifts. (b) The corresponding survival weight $w = \exp(-\zeta) \in (0, 1]$, demonstrating that the DHP can only attenuate and never amplify token contributions.

Theorems & Definitions (1)

  • Theorem 1: Second-Moment Stability of the Mini-batch Gradient Estimator