Table of Contents
Fetching ...

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe

TL;DR

LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates, demonstrating a principled and practical way to rescue "negative groups", improving efficiency and performance in RLVR.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

TL;DR

LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates, demonstrating a principled and practical way to rescue "negative groups", improving efficiency and performance in RLVR.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.

Paper Structure

This paper contains 30 sections, 2 theorems, 47 equations, 6 figures, 3 tables.

Key Result

Theorem 1

The gradient of the log-likelihood $\widehat{\mathcal{L}}(\theta)$ with respect to the parameters $\theta$ is given by

Figures (6)

  • Figure 1: Overview of our approach. Standard approaches like GRPO assign a uniform reward of $0$ to all incorrect answers. This provides no learning signal, causing these samples to be discarded. Our method, LENS, is derived from reward modeling via Maximum Likelihood Estimation (MLE) and assigns non-zero, confidence-dependent rewards to incorrect responses. This creates a clear learning signal where differences emerge from the samples, converting previously discarded information into useful gradient updates.
  • Figure 2: Negative group ratio during GRPO training of Llama-3.1-8B-Instruct with MATH and Numina 1.5. $G=16$.
  • Figure 3: An optimal policy $\uppi^{\star}$ is derived from reward probabilities $p^\star$ through normalization (see Equation (\ref{['eq:policy2prob']})). This approach reframes the task of finding the best policy as a more straightforward statistical problem: learning a reward model from data.
  • Figure 4: Illustration of the weight function $w(z)$.
  • Figure 5: Comparison of our algorithm and GRPO baseline: performance on the full MATH test set and the Levels 4–5 (hard) subset. Top: Llama-3.1-8B-Instruct; bottom: Qwen-2.5-3B-Base. The accuracy is averaged across all 16 generations during evaluation and over two independent runs. Training set: MATH + DAPO. Our algorithm brings improvement for both models.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2