Table of Contents
Fetching ...

Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data

Yunhao Tang, Sid Wang, Lovish Madaan, Rémi Munos

TL;DR

This work tackles scaling reinforcement learning for language models to unverifiable data by introducing JEPO, a Jensen's lower bound-based policy optimization method that treats chain-of-thought as a latent variable. By employing a multi-sample Jensen lower bound and combining a variance-reduced RL-like update with a supervised loss, JEPO trains effectively without requiring externally verifiable rewards, enabling long-form reasoning tasks such as proofs. The authors derive the theoretical connections to ELBO and RL, provide detailed implementation strategies, and demonstrate competitive or superior performance across verifiable (short-form math), semi-verifiable (Numina with mixed rewards), and unverifiable (Numina-proof) data. This approach broadens the applicability of RL-style training to abundant long-form data, with practical implications for scaling reasoning in large language models while maintaining stability and data efficiency.

Abstract

We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.

Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data

TL;DR

This work tackles scaling reinforcement learning for language models to unverifiable data by introducing JEPO, a Jensen's lower bound-based policy optimization method that treats chain-of-thought as a latent variable. By employing a multi-sample Jensen lower bound and combining a variance-reduced RL-like update with a supervised loss, JEPO trains effectively without requiring externally verifiable rewards, enabling long-form reasoning tasks such as proofs. The authors derive the theoretical connections to ELBO and RL, provide detailed implementation strategies, and demonstrate competitive or superior performance across verifiable (short-form math), semi-verifiable (Numina with mixed rewards), and unverifiable (Numina-proof) data. This approach broadens the applicability of RL-style training to abundant long-form data, with practical implications for scaling reasoning in large language models while maintaining stability and data efficiency.

Abstract

We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.

Paper Structure

This paper contains 61 sections, 5 theorems, 36 equations, 11 figures, 1 algorithm.

Key Result

Lemma 1

(Jensen's lower bound as a special case of ELBO) When $q_\phi(c|x,a^\ast)\coloneqq\pi_\theta(c|x)$, ELBO is equivalent to the Jensen's lower bound $\mathcal{L}_{\theta,\phi}(x,a^\ast)=\mathcal{L}_{\theta}(x,a^\ast)$ stochastic gradient estimates.

Figures (11)

  • Figure 1: A canonical RL algorithm updates both its chain-of-thought policy $\pi_\theta(c|x)$ and the final conclusion $\pi_\theta(a|x,c)$ with advantage function computed from reward $r_i$ and an optional baseline $v_i$. JEPO has similar counterparts: updating the chain-of-thought policy using likelihood scores as the effective reward, and updating the answer policy using a supervised loss. Unlike RL baselines, JEPO does not require access to a reward $r_i$ but only access to a ground truth answer $a^*$. Due to the implementation-level similarity between JEPO and RL, it is straightforward to incorporate JEPO into existing stacks of large-scale RL training. We use the same baseline notation for the RL and JEPO loss, though they differ in practice. In general $v_i$ can be a leave-one-out control variate that is computed from other $n-1$ samples in the batch.
  • Figure 2: Graphical models for various algorithmic formulations discussed in this work. Solid lines represent generative models and dashed lines represent inference models. Circles represent random variables and squares represent parameters. Shading indicates that the random variable is observed, and is used for providing feedback for the learning process. For CoT optimization, $a^*$ is a simplified notation for the binary optimality variable $\mathds{1}_{\{a=a^\ast\}}$ from the random variable $a$. See Appendix \ref{['appendix:graph']} for a more detailed explanation.
  • Figure 3: Verifiable data experiments with MATH. We compare three baselines: online RL with access to the oracle Sympy-based reward and JEPO. In the left plot, we monitor the reward on the training dataset. Online RL obtains the best training time trade-off, followed by multi-sample lower bound and the single-sample lower bound; In the middle plot, we monitor the evaluation on a test set during training. Multi-sample lower bound and online RL obtains similar performance; In the right plot, we graph training reward against the lower bound objectives, averaged over training tokens. The two signals bear positive correlations overall and multi-sample lower bound yields better correlations.
  • Figure 4: Ablation of number of samples $n$ for multi-sample lower bounds. As we increase the number of samples, the multi-sample lower bound seems to further improve the training-time efficiency. This corroborates the theoretical insight that as $n$ increases, the multi-sample lower bound objectives become tighter.
  • Figure 5: Ablation of regularization coefficient $\beta$. As $\beta$ increases, all algorithmic variants seem to obtain better efficiency in the training performance-KL divergence trade-off. However, strong regularization also prevents the policy from deviating much from the reference policy, preventing bigger training improvements.
  • ...and 6 more figures

Theorems & Definitions (11)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Definition 3: A variance-reduced RL policy gradient estimate
  • Lemma 4
  • Theorem 5
  • Lemma 6
  • proof
  • proof : Proof of Theorem \ref{['var-red']}
  • ...and 1 more