Table of Contents
Fetching ...

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Luckeciano C. Melo, Alessandro Abate, Yarin Gal

TL;DR

The paper tackles instability and sample inefficiency in policy-gradient RL for LLM-based reasoning. It introduces CAPO, a curvature-aware data-selection method that relies on a tractable last-layer curvature model to anticipate unstable updates and enforce a local trust region. The authors provide monotonic-improvement guarantees under practical assumptions and demonstrate up to 30× improvements in sample efficiency on math-reasoning benchmarks with minimal token rejection and overhead. This approach offers a scalable path to more reliable, efficient RL fine-tuning of LLMs for complex reasoning tasks.

Abstract

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30x improvement in sample efficiency over standard GRPO for LLM reasoning.

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

TL;DR

The paper tackles instability and sample inefficiency in policy-gradient RL for LLM-based reasoning. It introduces CAPO, a curvature-aware data-selection method that relies on a tractable last-layer curvature model to anticipate unstable updates and enforce a local trust region. The authors provide monotonic-improvement guarantees under practical assumptions and demonstrate up to 30× improvements in sample efficiency on math-reasoning benchmarks with minimal token rejection and overhead. This approach offers a scalable path to more reliable, efficient RL fine-tuning of LLMs for complex reasoning tasks.

Abstract

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30x improvement in sample efficiency over standard GRPO for LLM reasoning.

Paper Structure

This paper contains 21 sections, 11 theorems, 85 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 5.1

Fix thresholds $\delta_H>0$ and $\delta_F>0$. Let $\mathcal{B}$ be a batch of sampled trajectories. Split $\mathcal{B}$ into disjoint $N$ subsets $b_{i} \subset \mathcal{B}$, and propose candidate subset updates $\{\Delta\theta_i\}_{i:N}$. Retain those satisfying: with $\omega > 0$ and $M$, $r$ defined as in Assumption ass:curvature-and-steps. Let $\mathcal{B}_{acc}$ denote the superset of the B

Figures (8)

  • Figure 1: Accuracy on MATH dataset from different RL methods. CAPO (ours) achieves $30\times$ greater sample efficiency under an aggressive (A) update regime (higher learning rate, smaller batch size), whereas GRPO suffers policy collapse.
  • Figure 2: Comparison with baseline methods on policy gradient stability. While the setup with more aggressive updates makes all methods more sample-efficient, it also leads the baselines to policy collapse. In contrast, CAPO prevents collapse and achieves up to $30\times$ greater sample efficiency than GRPO under aggressive updates.
  • Figure 3: Evaluation of policy and objective shifts estimates from the proposed computational model during training. Unstable methods exhibit large and abrupt directional curvatures, while stable ones maintain much smaller and smoother shifts. CAPO, by applying token-level bounds, also ensures well-behaved shifts at the global (batch) level, supporting the rationale of Theorem \ref{['thm:capo-certified-main']}.
  • Figure 4: Evaluation of extended versions of RL methods with curvature-aware selection. Incorporating curvature-aware selection consistently improves the base methods, preventing policy collapse and demonstrating the broader applicability of our approach across different policy optimization objectives.
  • Figure 5: Token rejection rate under CAPO. It maintains a low rejection rate over training, stabilizing learning with minimal intervention.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Theorem 5.1: Monotonic improvement under CAPO
  • Proposition A.1: Second-order expansion with integral remainder
  • proof
  • Lemma B.1: The grad-log-prob identity
  • proof
  • Lemma B.2: Fisher identity
  • proof
  • Proposition B.1: Second-order expansion with integral remainder
  • proof
  • Proposition C.1: Gradient w.r.t.last-layer model of a softmax policy
  • ...and 11 more