Table of Contents
Fetching ...

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang

TL;DR

This paper tackles the challenge of sparse reward signals and unstable training in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. It introduces PACS, a framework that recasts RLVR as a supervised learning problem by training a policy-parameterized score function $\psi(q,o;\pi_\theta)$ via cross-entropy, and incidentally achieves implicit actor–critic coupling through shared parameters. A gradient analysis shows the approach recovers standard policy gradient updates while providing lower-variance, stable training signals, and a score function instantiated via REINFORCE Leave-One-Out further stabilizes learning. Empirically, PACS consistently outperforms strong RLVR baselines and strong open-source models on math reasoning benchmarks, enhances solution diversity, and generalizes to out-of-domain tasks, demonstrating practical impact for scalable reasoning with verifiable rewards. The work also presents theoretical insights into entropy preservation via implicit Gibbs regularization and provides open-source code for reproducibility.

Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of $\textbf{+8.26\%}$ (4B) and $\textbf{+9.57\%}$ (8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

TL;DR

This paper tackles the challenge of sparse reward signals and unstable training in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. It introduces PACS, a framework that recasts RLVR as a supervised learning problem by training a policy-parameterized score function via cross-entropy, and incidentally achieves implicit actor–critic coupling through shared parameters. A gradient analysis shows the approach recovers standard policy gradient updates while providing lower-variance, stable training signals, and a score function instantiated via REINFORCE Leave-One-Out further stabilizes learning. Empirically, PACS consistently outperforms strong RLVR baselines and strong open-source models on math reasoning benchmarks, enhances solution diversity, and generalizes to out-of-domain tasks, demonstrating practical impact for scalable reasoning with verifiable rewards. The work also presents theoretical insights into entropy preservation via implicit Gibbs regularization and provides open-source code for reproducibility.

Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose , a novel RLVR framework that achieves imlicit ctor ritic coupling via a upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of (4B) and (8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

Paper Structure

This paper contains 37 sections, 2 theorems, 27 equations, 6 figures, 13 tables.

Key Result

Proposition A.1

The gradient of the PACS objective function $\mathcal{L}_{\text{PACS}}(\theta)$ with respect to the policy parameters $\theta$ can be formulated as a policy gradient update weighted by an effective advantage function derived from the prediction error and the cross-entropy loss.

Figures (6)

  • Figure 1: Comparison between RLVR and the supervised learning reformulation, where the query and output are input, and the outcome reward is treated as a predictable label.
  • Figure 2: An illustration of the PACS framework. The framework consists of three main components: (1) Reward Proxy Computation, which calculates a reward proxy $\hat{r}$ based on the log-probability ratio. (2) Group Computation, which computes RLOO-based advantage scores $\psi$ from the reward proxies. (3) Cross-Entropy Loss, which converts the RLVR problem into a supervised learning task, optimizing a scoring function parameterized by the policy with a cross-entropy loss.
  • Figure 3: Training dynamics of Qwen3-8B. The curves illustrate the evolution of $\text{pass}@1$ on math benchmarks throughout the training process.
  • Figure 4: Performance analysis of PACS with varying $\beta$. The 3D heatmaps show $\text{pass@}k$ scores for different combinations of $\beta$ values (0.1, 0.5, 1, 2, 10) and $k$ values on AMC23, AIME-2024, AIME-2025 and BeyondAIME.
  • Figure 5: Exploration and Diversity Analysis. (a) Entropy loss dynamics for Qwen3-4B (top) and 8B (bottom). Unlike baselines that suffer from entropy collapse, PACS maintains higher entropy, enabling sustained exploration.(b) Centered PCA projection of correct solutions for sampled problems (ID shown in bottom-right). The broader semantic coverage of PACS (Blue) compared to GRPO (Red) visually confirms superior diversity.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition A.1: Gradient Derivation
  • proof
  • Proposition A.2: Implicit Gibbs Regularization
  • proof