Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening
Andre He, Daniel Fried, Sean Welleck
TL;DR
The paper identifies a rank bias in GRPO that reinforces already probable solutions and under-explores rare but correct ones, limiting multi-sample proof performance. It introduces Unlikeliness Reward to up-weight low-probability correct trajectories and finds that increasing PPO epochs per batch also helps, albeit with cost. Together, these yield a revised GRPO training recipe for formal theorem proving, achieving competitive results with DeepSeek-Prover-V1.5-RL on miniF2F and releasing an open implementation. This work advances multi-sample exploration in verifier-based RL for formal mathematics by explicitly encouraging rare correct outputs and clarifying optimization dynamics.
Abstract
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release
