Table of Contents
Fetching ...

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang

TL;DR

The Asymmetric Confidence-aware Error Penalty (ACE) is proposed, which introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages and can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

TL;DR

The Asymmetric Confidence-aware Error Penalty (ACE) is proposed, which introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages and can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
Paper Structure (71 sections, 3 theorems, 50 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 71 sections, 3 theorems, 50 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{L}_{\mathrm{std}}(\theta)$ denote the standard policy gradient objective (Eq. eq:grpo) with uniform negative advantages $\hat{A}^-$, and let $\mathcal{L}_{\mathrm{ACE}}(\theta)$ denote the objective with ACE advantages (Eq. eq:ace_advantage). Define the selective regularizer: where $|\hat{A}^-(x)|$ is the magnitude of the standard GRPO negative advantage for prompt $x$. Assume rollo

Figures (4)

  • Figure 1: ACE Method Overview.Top: Incorrect rollouts fall into three regimes based on the confidence shift $c_i = \log(\pi_\theta(y_i|x)/\pi_{\mathrm{ref}}(y_i|x))$. Bottom-left: Standard GRPO assigns a uniform penalty $|\hat{A}^-|$ to all errors regardless of regime. Bottom-right: ACE modulates the penalty via $\text{Softplus}(c_i)$, strongly penalizing overconfident errors while leaving self-correcting errors nearly untouched.
  • Figure 2: Performance Comparison across Benchmarks. Pass@$k$ curves for all five methods on MATH-500 (left column) and AIME 2025 (right column) across three model families: Qwen2.5-Math-7B (top row), Qwen3-8B-Base (middle row), and Llama-3.1-8B-Instruct (bottom row). ACE-GRPO and ACE-DAPO consistently outperform their respective baselines (GRPO and DAPO) across all sampling budgets, model families, and benchmarks, with larger gains at higher $k$ values. ACE-DAPO achieves the best overall performance, confirming that ACE's rollout-level correction composes with DAPO's token-level diversity preservation and generalizes across model families.
  • Figure 3: Overconfident Error Dynamics. Left: Overconfident error fraction (OEF) over training. Right: Mean overconfidence magnitude for $c_i > 0$ errors. ACE-GRPO effectively suppresses both metrics compared to standard GRPO.
  • Figure 4: Entropy Dynamics. Token-level entropy over the first 20 training steps. Left: On Qwen2.5-Math-7B, ACE-GRPO retains substantially more entropy than standard GRPO, which suffers rapid entropy collapse. Right: On Qwen3-8B-Base, ACE-GRPO maintains more stable entropy, demonstrating consistency across architectures. We report entropy dynamics for the two Qwen models only; Llama-3.1-8B-Instruct is excluded because its lower baseline accuracy makes the entropy signal less directly comparable (see §\ref{['sec:experiments']} for discussion).

Theorems & Definitions (12)

  • Definition 1: Error Confidence Score
  • Remark 1: Three regimes
  • Definition 2: ACE Advantage
  • Theorem 1: Selective Regularization Decomposition
  • proof
  • Remark 2: Residual term and contrast with global KL
  • Remark 3: Why stop-gradient is preferable to the full regularizer
  • Proposition 1: Second Moment Increase
  • proof
  • Definition 3: Directional Signal and Variance
  • ...and 2 more