Table of Contents
Fetching ...

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Andre He, Daniel Fried, Sean Welleck

TL;DR

The paper identifies a rank bias in GRPO that reinforces already probable solutions and under-explores rare but correct ones, limiting multi-sample proof performance. It introduces Unlikeliness Reward to up-weight low-probability correct trajectories and finds that increasing PPO epochs per batch also helps, albeit with cost. Together, these yield a revised GRPO training recipe for formal theorem proving, achieving competitive results with DeepSeek-Prover-V1.5-RL on miniF2F and releasing an open implementation. This work advances multi-sample exploration in verifier-based RL for formal mathematics by explicitly encouraging rare correct outputs and clarifying optimization dynamics.

Abstract

Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

TL;DR

The paper identifies a rank bias in GRPO that reinforces already probable solutions and under-explores rare but correct ones, limiting multi-sample proof performance. It introduces Unlikeliness Reward to up-weight low-probability correct trajectories and finds that increasing PPO epochs per batch also helps, albeit with cost. Together, these yield a revised GRPO training recipe for formal theorem proving, achieving competitive results with DeepSeek-Prover-V1.5-RL on miniF2F and releasing an open implementation. This work advances multi-sample exploration in verifier-based RL for formal mathematics by explicitly encouraging rare correct outputs and clarifying optimization dynamics.

Abstract

Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@ across a large range of in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release

Paper Structure

This paper contains 29 sections, 14 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We identify a rank bias in GRPO in which model updates only reinforce already probable solutions and fail to surface new ones. This sharpens the distribution and impairs pass@$N$ performance for large N. Our unlikeliness reward addresses rank bias by explicitly encouraging uplifting low-probability correct solutions.
  • Figure 2: Finetuning DeepSeek-Prover-V1.5-SFT with GRPO, evaluated on $\mathcal{D}_{\text{val}}$. GRPO improves pass@$N$ significantly for small $N$, but performs worse than the base model for large $N$. We aim to understand this behavior and develop methods to overcome it.
  • Figure 3: Improvement in expected pass@$N$ assuming RL increases correct solution probabilities by a factor of $1 + \epsilon$ with $\epsilon=0.2$. Each curve corresponds to an initial $p_0 \in {1/2, 1/8, 1/32, 1/128, 1/512}$.
  • Figure 4: Uplift rate $u_j$ as a function of rank $j$ among positive samples. GRPO rarely increases the probability of lowest-ranked (i.e. rarest) correct samples.
  • Figure 5: Performance of GRPO variants on $\mathcal{D}_\text{val}$. Both the unlikeliness reward and additional PPO epochs improve pass@N. Appendix \ref{['appendix:eval']} details how we compute these metrics.
  • ...and 4 more figures