Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification
Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, Guanhua Chen
TL;DR
The paper identifies Recursive Space Contraction as a core failure mode in RLVR where on-policy pruning collapses valid reasoning paths. It proposes Anchored Policy Optimization (APO), which replaces global Shape Matching with Support Coverage on a Safe Manifold, and introduces Ratio Rectification to align gradients and enable Elastic Recovery. Theoretical results show APO gradients maximize safe-manifold mass during error correction, while empirical results across multiple models and five math benchmarks show improved Pass@1 without sacrificing Pass@K diversity, effectively breaking the usual accuracy-diversity trade-off. This approach offers a principled, geometry-aware alternative to KL regularization, with practical impact for robust reasoning in large language models.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
