Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data
Yunhao Tang, Sid Wang, Lovish Madaan, Rémi Munos
TL;DR
This work tackles scaling reinforcement learning for language models to unverifiable data by introducing JEPO, a Jensen's lower bound-based policy optimization method that treats chain-of-thought as a latent variable. By employing a multi-sample Jensen lower bound and combining a variance-reduced RL-like update with a supervised loss, JEPO trains effectively without requiring externally verifiable rewards, enabling long-form reasoning tasks such as proofs. The authors derive the theoretical connections to ELBO and RL, provide detailed implementation strategies, and demonstrate competitive or superior performance across verifiable (short-form math), semi-verifiable (Numina with mixed rewards), and unverifiable (Numina-proof) data. This approach broadens the applicability of RL-style training to abundant long-form data, with practical implications for scaling reasoning in large language models while maintaining stability and data efficiency.
Abstract
We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.
