Table of Contents
Fetching ...

Policy Improvement Reinforcement Learning

Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

Policy Improvement Reinforcement Learning

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

Paper Structure

This paper contains 43 sections, 16 theorems, 70 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Under the above conditions, the expected GRPO update direction satisfies where $g_{\mathrm{ideal}} = \nabla_\theta J_{\mathrm{RLVR}}(\theta)$ and the scaling factor is Thus, GRPO applies a state-dependent rescaling to the ideal gradient, indicating that it does not perform unbiased gradient ascent on RLVR objective. $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Overview of Policy Improvement Reinforcement Learning (PIRL) framework.Left: Traditional RLVR methods follow an open-loop paradigm, updating policies from instantaneous rewards without verifying actual improvement. Middle: PIRL introduces a verification stage, forming a closed-loop optimization driven by policy improvement signals. Right: During verification, updates are adaptively regulated: positive signals ($\Delta J > 0$) are amplified, while negative signals ($\Delta J < 0$) trigger rectification to suppress harmful updates and stabilize training.
  • Figure 2: Theoretical distortion and empirical instability of GRPO. (a) Gradient Distortion: The gradient scaling factor $\eta(p_t)$ evaluated across success rates $p_t$. As established in Corollary \ref{['cor:gradient_explosion']}, GRPO ($G=8, 128$) exhibit severe sensitivity explosion at the boundaries ($p_t \to 0, 1$). (b) Empirical Stability: Standard GRPO suffers from drastic gradient norm spikes (left) and severe Pass@1 collapse (right). Incorporating PIPO effectively stabilizes training.
  • Figure 3: Comparison of training dynamics. (a-b) Accuracy evolution on 4B and 8B models. (c-d) Response length evolution. PIPO (solid lines) consistently outperforms baselines (dashed lines) and promotes sustained reasoning chain growth.
  • Figure 4: Efficiency on MATH dataset. (a) Step Overhead: PIPO incurs a 12–19% per-step latency increase (b) Wall-clock Efficiency: Despite lower throughput, PIPO reaches higher accuracy faster than GRPO

Theorems & Definitions (29)

  • Definition 2.1: Policy Success Rate
  • Theorem 3.1: Gradient Distortion
  • Corollary 3.2: Sensitivity Explosion
  • Definition 4.1: Policy Improvement
  • Definition 4.2: PIRL Objective
  • Theorem 4.3: Objective Alignment
  • Proposition 4.4: Objective Consistency
  • Definition 5.1: Policy Improvement Reward
  • Theorem 6.3: PIPO Approximates Smoothed PIRL Ascent
  • Theorem 6.4: Conditional Geometric Rectification
  • ...and 19 more