Table of Contents
Fetching ...

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Michael Bereket, Jure Leskovec

TL;DR

<3-5 sentence high-level summary> The paper investigates whether reinforcement-learning-based reasoning can extend from deterministic domains to stochastic, verifiable domains. It compares GRPO, PPO, and RLOO on synthetic data and a real CRISPR perturb-seq task, showing that GRPO with group standard normalization induces overconfident probability predictions while PPO and RLOO yield well-calibrated results; GRPO without standard normalization fixes miscalibration, supported by a theoretical analysis of bias in GRPO advantages. The authors argue that standard normalization in GRPO introduces a policy-dependent bias that amplifies overconfident predictions, and they advocate unbiasedness as a design principle for reasoning RL to enable robust reasoning under uncertainty in scientific contexts.

Abstract

Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with stochastic outcomes, like scientific experiments. Through applications to synthetic data and real-world biological experiments, we demonstrate that Group Relative Policy Optimization (GRPO) induces overconfident probability predictions for binary stochastic outcomes, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why normalization causes overconfidence. Our results provide new evidence against the use of standard normalization in GRPO and help pave the way for applications of RL for reasoning language models beyond deterministic domains.

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

TL;DR

<3-5 sentence high-level summary> The paper investigates whether reinforcement-learning-based reasoning can extend from deterministic domains to stochastic, verifiable domains. It compares GRPO, PPO, and RLOO on synthetic data and a real CRISPR perturb-seq task, showing that GRPO with group standard normalization induces overconfident probability predictions while PPO and RLOO yield well-calibrated results; GRPO without standard normalization fixes miscalibration, supported by a theoretical analysis of bias in GRPO advantages. The authors argue that standard normalization in GRPO introduces a policy-dependent bias that amplifies overconfident predictions, and they advocate unbiasedness as a design principle for reasoning RL to enable robust reasoning under uncertainty in scientific contexts.

Abstract

Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with stochastic outcomes, like scientific experiments. Through applications to synthetic data and real-world biological experiments, we demonstrate that Group Relative Policy Optimization (GRPO) induces overconfident probability predictions for binary stochastic outcomes, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why normalization causes overconfidence. Our results provide new evidence against the use of standard normalization in GRPO and help pave the way for applications of RL for reasoning language models beyond deterministic domains.

Paper Structure

This paper contains 17 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Group standard normalization in GRPO induces overconfident predictions of stochastic outcome probabilities. Top: Probability prediction task. Bottom: Synthetic data experiment results. Models trained with PPO, RLOO, and GRPO with no standard normalization are well calibrated, while models trained with GRPO are extremely overconfident.
  • Figure 2: Real-world biological experiment prediction results. We optimize Qwen3-4B to predict the probability of binary experimental outcomes (will perturbing a target gene have a strong effect on a specified phenotype in cells?) with rewards derived from real-world experiments. We find that models optimized with PPO, RLOO, and GRPO with no standard normalization achieve well-calibrated predictions for held-out test perturbations, while GRPO predicts highly overconfident probabilities. Error bars represent 95% confidence intervals.
  • Figure 3: Bias in GRPO advantage estimates explains overconfident predictions. Advantages are computed with a log-likelihood reward. Left: Under a uniform policy, both GRPO and GRPO without standard normalization closely approximate the true advantages. Middle: Under a policy concentrated on the true probability, GRPO overestimates the advantage of overconfident predictions. Right: As the policy becomes increasingly overconfident, GRPO increasingly overestimates the advantage of more overconfident predictions. This pattern creates a positive feedback loop towards increasingly overconfident predictions consistent with our experimental observations.
  • Figure 4: Analysis of advantage estimates with a reward based on the Brier score. We observe a similar pattern of overestimated advantages for overconfident probabilities as observed with a log-likelihood in Fig. \ref{['fig:cause']}
  • Figure 5: Empirical estimates of $\sigma_0$ and $\sigma_1$ (standard deviation of rewards within groups for answers 0 and 1) for the three policies in Figures \ref{['fig:cause']} and \ref{['fig:brier_advantage_estimates']}. As the policies concentrate on predictions greater than 0.5, $\sigma_0$ becomes larger than $\sigma_1$.
  • ...and 3 more figures