Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Yuanda Xu; Hejian Sang; Zhengze Zhou; Ran He; Zhipeng Wang

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang

TL;DR

The Asymmetric Confidence-aware Error Penalty (ACE) is proposed, which introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages and can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

TL;DR

Abstract

Paper Structure (71 sections, 3 theorems, 50 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 71 sections, 3 theorems, 50 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Our contribution.
Related Work
Curriculum and advantage shaping.
KL regularization in RLHF/RLVR.
Entropy regularization and clipping strategies.
Reward shaping.
Diversity loss in RLVR.
Preliminaries
Setting.
GRPO objective.
Observation: uniform penalty within groups.
Motivation: overconfident errors.
The ACE Method
Error Confidence Score
...and 56 more sections

Key Result

Theorem 1

Let $\mathcal{L}_{\mathrm{std}}(\theta)$ denote the standard policy gradient objective (Eq. eq:grpo) with uniform negative advantages $\hat{A}^-$, and let $\mathcal{L}_{\mathrm{ACE}}(\theta)$ denote the objective with ACE advantages (Eq. eq:ace_advantage). Define the selective regularizer: where $|\hat{A}^-(x)|$ is the magnitude of the standard GRPO negative advantage for prompt $x$. Assume rollo

Figures (4)

Figure 1: ACE Method Overview.Top: Incorrect rollouts fall into three regimes based on the confidence shift $c_i = \log(\pi_\theta(y_i|x)/\pi_{\mathrm{ref}}(y_i|x))$. Bottom-left: Standard GRPO assigns a uniform penalty $|\hat{A}^-|$ to all errors regardless of regime. Bottom-right: ACE modulates the penalty via $\text{Softplus}(c_i)$, strongly penalizing overconfident errors while leaving self-correcting errors nearly untouched.
Figure 2: Performance Comparison across Benchmarks. Pass@$k$ curves for all five methods on MATH-500 (left column) and AIME 2025 (right column) across three model families: Qwen2.5-Math-7B (top row), Qwen3-8B-Base (middle row), and Llama-3.1-8B-Instruct (bottom row). ACE-GRPO and ACE-DAPO consistently outperform their respective baselines (GRPO and DAPO) across all sampling budgets, model families, and benchmarks, with larger gains at higher $k$ values. ACE-DAPO achieves the best overall performance, confirming that ACE's rollout-level correction composes with DAPO's token-level diversity preservation and generalizes across model families.
Figure 3: Overconfident Error Dynamics. Left: Overconfident error fraction (OEF) over training. Right: Mean overconfidence magnitude for $c_i > 0$ errors. ACE-GRPO effectively suppresses both metrics compared to standard GRPO.
Figure 4: Entropy Dynamics. Token-level entropy over the first 20 training steps. Left: On Qwen2.5-Math-7B, ACE-GRPO retains substantially more entropy than standard GRPO, which suffers rapid entropy collapse. Right: On Qwen3-8B-Base, ACE-GRPO maintains more stable entropy, demonstrating consistency across architectures. We report entropy dynamics for the two Qwen models only; Llama-3.1-8B-Instruct is excluded because its lower baseline accuracy makes the entropy signal less directly comparable (see §\ref{['sec:experiments']} for discussion).

Theorems & Definitions (12)

Definition 1: Error Confidence Score
Remark 1: Three regimes
Definition 2: ACE Advantage
Theorem 1: Selective Regularization Decomposition
proof
Remark 2: Residual term and contrast with global KL
Remark 3: Why stop-gradient is preferable to the full regularizer
Proposition 1: Second Moment Increase
proof
Definition 3: Directional Signal and Variance
...and 2 more

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

TL;DR

Abstract

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)