Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma; Xueru Wen; Boxi Cao; Yaojie Lu; Hongyu Lin; Jinglin Yang; Min He; Xianpei Han; Le Sun

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

TL;DR

This work proposes DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives and demonstrates that its DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

TL;DR

Abstract

Paper Structure (38 sections, 10 theorems, 62 equations, 8 figures, 3 tables)

This paper contains 38 sections, 10 theorems, 62 equations, 8 figures, 3 tables.

Introduction
Preliminaries and Related Work
Group Relative Policy Optimization (GRPO)
Confidence Estimation
Calibration Estimation
Calibration Optimization Methods
Empirical Analysis for Calibration Degeneration
Over-Confidence of Current LLMs
RLVR Leads to Calibration Degradation
Accuracy-Calibration Tradeoff of Coupled Optimization
Theoretical Analysis
Trajectory-Level Reinforcement Learning Induces Over-Confidence
Accuracy-Calibration Gradient Conflict
Group-Level Accuracy as Low-Variance Supervision
Decoupled Calibration Policy Optimization
...and 23 more sections

Key Result

Proposition 4.1

In the absence of explicit entropy regularization, any optimal solution to $\max_\theta J_{\mathrm{acc}}(\theta)$ assigns probability mass $1$ to a single trajectory $y^\ast\in\mathcal{Y}^+$.

Figures (8)

Figure 1: Illustration of gradient conflict between policy accuracy maximization and calibration error minimization.
Figure 2: The overall framework of DCPO, which leverages block-wise verbalized confidence rollout and decoupled advantage estimation to decouple the optimization objectives of accuracy and calibration, and further integrates instance-level and group-level signals for more stable calibration optimization.
Figure 3: Reliability diagrams for different LLMs. The dashed line denotes perfect calibration; bar height indicates empirical accuracy per confidence bin, and color intensity reflects sample frequency. The Expected Calibration Error (ECE) is reported above each subplot, revealing prevalent over-confidence across models.
Figure 4: The impact of RLVR on LLM calibration, which demonstrate that model confidence increases during RLVR training and RLVR exacerbates the models' over-confidence.
Figure 5: The accuracy and calibration performance of QWEN3-8B trained with different RL methods.The figures illustrate that while existing calibration optimization methods can improve model calibration, their accuracy decreases.
...and 3 more figures

Theorems & Definitions (20)

Proposition 4.1: Mode Collapse
proof
Proposition 4.2: Gradient Conflict
proof
Proposition 4.3
proof
Proposition 4.4
proof
Theorem 5.1: Statistical Optimality of Decoupled Calibration
proof
...and 10 more

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

TL;DR

Abstract

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (20)