Table of Contents
Fetching ...

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

TL;DR

This work proposes DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives and demonstrates that its DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

TL;DR

This work proposes DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives and demonstrates that its DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
Paper Structure (38 sections, 10 theorems, 62 equations, 8 figures, 3 tables)

This paper contains 38 sections, 10 theorems, 62 equations, 8 figures, 3 tables.

Key Result

Proposition 4.1

In the absence of explicit entropy regularization, any optimal solution to $\max_\theta J_{\mathrm{acc}}(\theta)$ assigns probability mass $1$ to a single trajectory $y^\ast\in\mathcal{Y}^+$.

Figures (8)

  • Figure 1: Illustration of gradient conflict between policy accuracy maximization and calibration error minimization.
  • Figure 2: The overall framework of DCPO, which leverages block-wise verbalized confidence rollout and decoupled advantage estimation to decouple the optimization objectives of accuracy and calibration, and further integrates instance-level and group-level signals for more stable calibration optimization.
  • Figure 3: Reliability diagrams for different LLMs. The dashed line denotes perfect calibration; bar height indicates empirical accuracy per confidence bin, and color intensity reflects sample frequency. The Expected Calibration Error (ECE) is reported above each subplot, revealing prevalent over-confidence across models.
  • Figure 4: The impact of RLVR on LLM calibration, which demonstrate that model confidence increases during RLVR training and RLVR exacerbates the models' over-confidence.
  • Figure 5: The accuracy and calibration performance of QWEN3-8B trained with different RL methods.The figures illustrate that while existing calibration optimization methods can improve model calibration, their accuracy decreases.
  • ...and 3 more figures

Theorems & Definitions (20)

  • Proposition 4.1: Mode Collapse
  • proof
  • Proposition 4.2: Gradient Conflict
  • proof
  • Proposition 4.3
  • proof
  • Proposition 4.4
  • proof
  • Theorem 5.1: Statistical Optimality of Decoupled Calibration
  • proof
  • ...and 10 more