Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
TL;DR
This paper tackles the challenge of sparse reward signals and unstable training in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. It introduces PACS, a framework that recasts RLVR as a supervised learning problem by training a policy-parameterized score function $\psi(q,o;\pi_\theta)$ via cross-entropy, and incidentally achieves implicit actor–critic coupling through shared parameters. A gradient analysis shows the approach recovers standard policy gradient updates while providing lower-variance, stable training signals, and a score function instantiated via REINFORCE Leave-One-Out further stabilizes learning. Empirically, PACS consistently outperforms strong RLVR baselines and strong open-source models on math reasoning benchmarks, enhances solution diversity, and generalizes to out-of-domain tasks, demonstrating practical impact for scalable reasoning with verifiable rewards. The work also presents theoretical insights into entropy preservation via implicit Gibbs regularization and provides open-source code for reproducibility.
Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of $\textbf{+8.26\%}$ (4B) and $\textbf{+9.57\%}$ (8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.
