Table of Contents
Fetching ...

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, Guanhua Chen

TL;DR

The paper identifies Recursive Space Contraction as a core failure mode in RLVR where on-policy pruning collapses valid reasoning paths. It proposes Anchored Policy Optimization (APO), which replaces global Shape Matching with Support Coverage on a Safe Manifold, and introduces Ratio Rectification to align gradients and enable Elastic Recovery. Theoretical results show APO gradients maximize safe-manifold mass during error correction, while empirical results across multiple models and five math benchmarks show improved Pass@1 without sacrificing Pass@K diversity, effectively breaking the usual accuracy-diversity trade-off. This approach offers a principled, geometry-aware alternative to KL regularization, with practical impact for robust reasoning in large language models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

TL;DR

The paper identifies Recursive Space Contraction as a core failure mode in RLVR where on-policy pruning collapses valid reasoning paths. It proposes Anchored Policy Optimization (APO), which replaces global Shape Matching with Support Coverage on a Safe Manifold, and introduces Ratio Rectification to align gradients and enable Elastic Recovery. Theoretical results show APO gradients maximize safe-manifold mass during error correction, while empirical results across multiple models and five math benchmarks show improved Pass@1 without sacrificing Pass@K diversity, effectively breaking the usual accuracy-diversity trade-off. This approach offers a principled, geometry-aware alternative to KL regularization, with practical impact for robust reasoning in large language models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
Paper Structure (46 sections, 3 theorems, 27 equations, 6 figures, 6 tables)

This paper contains 46 sections, 3 theorems, 27 equations, 6 figures, 6 tables.

Key Result

Proposition 2.1

In the unclipped regime, the gradient of the APO pull term with respect to a negative advantage is collinear with the gradient of $\mathcal{J}_{\text{support}}$.

Figures (6)

  • Figure 1: Geometric Interpretation of Regularization Paradigms.(a) Recursive Space Contraction (Vanilla PG): The interplay of positive sharpening and blind negative squeezing drives probability mass to collapse into a single narrow path, permanently discarding valid support. (b) Shape Matching (KL Regularization): KL imposes a rigid constraint (visualized as springs) that forces $\pi_{\theta}$ to mimic the exact density profile of $\pi_{\text{ref}}$, prohibiting the local sharpening necessary for high Pass@1 efficiency. (c) Support Coverage (APO, Ours): Our method maximizes mass coverage within a Safe Manifold ($\mathcal{M}_{\text{safe}}$). This permits aggressive sharpening for accuracy, while the anchor force provides Elastic Recovery during error correction to prevent leakage into invalid regions.
  • Figure 2: Geometric Interpretation of Gradient Dynamics. The x and y axes represent the directions of reward maximization and manifold consistency, respectively. (a) GRPO + KL (Baseline): The KL penalty gradient ($\nabla \mathcal{L}_{\mathrm{KL}}$, Red) conflicts with the reward gradient ($\nabla \mathcal{L}_{\mathrm{GRPO}}$, Black), driving the resultant vector (Purple) to breach the Trust Region. (b) APO (Ours): APO produces a unified gradient ($\nabla \mathcal{L}_{\mathrm{APO}}$, Blue) intrinsically aligned with the Safe Manifold, ensuring stable and efficient updates.
  • Figure 3: Comparison of Error Correction Dynamics.(Left) Irreversible Pruning: In Standard RLVR, the Squeezing Effect causes blind mass redistribution upon error, permanently collapsing alternative branches (greyed out). (Right) Elastic Recovery via APO: When the sharpened path hits an error (Red X), the Pull Term in our rectified ratio acts as a restoring force. It re-activates the Anchor Set (glowing blue nodes) derived from the reference priors, allowing the agent to "backtrack" and explore valid alternative paths (solid blue arrow) that were previously suppressed.
  • Figure 4: Efficiency-Diversity Pareto Frontier. Comparison of APO against baselines on Qwen2.5-Math-7B. The dashed red arrow illustrates the Diversity Collapse phenomenon observed in standard GRPO, where efficiency gains come at the cost of coverage. The solid blue arrow highlights APO's Pareto Improvement, achieving state-of-the-art Pass@1 efficiency while simultaneously restoring the diverse support lost during RL training.
  • Figure 5: Hyperparameter Sensitivity Analysis. We visualize the impact of the Pull coefficient $\beta$ (Left), Push coefficient $\lambda$ (Middle), and Anchor Size $K$ (Right) on Pass@1 performance. The Red Star denotes our default configuration. The inverted-U trends in the middle and right plots confirm that APO operates in a optimal trade-off, where the regularization is strong enough to correct errors but compliant enough to permit valid sharpening.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Proposition 2.1: Gradient Alignment
  • proof
  • Proposition 2.2: Passive Suppression via Normalization
  • proof
  • Proposition 2.3: Vanishing Recovery under Proportional Redistribution
  • proof