Table of Contents
Fetching ...

C$^2$GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning

Haotian Liu, Shuo Wang, Hongteng Xu

TL;DR

The paper addresses overconfidence in RL-based reasoning for post-trained LLMs and introduces Confidence-calibrated Group Sequence Policy Gradient (C^2GSPG). By defining sequence-level confidence c_theta,i from normalized sequence probabilities and coupling a BCE regularizer with a group sequence policy gradient, the method eliminates token-level bias and aligns optimization with calibration, especially in binary rewards. It extends to non-binary rewards via nonlinear reward normalization and adaptive clipping to reduce gradient conflicts, and demonstrates superior reasoning accuracy and confidence calibration on logical and mathematical tasks, with open-source code available. The approach advances self-aware reasoning by making model confidence reflect actual performance and by stabilizing training across diverse reasoning tasks and model scales.

Abstract

Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models. However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models. In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C$^2$GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence. In principle, we propose a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly appearing in GRPO and its variants. In this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence's reward. We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction. For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating the potential conflict between the two objectives. Applying C$^2$GSPG to post-train large language models in logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration. The code of C$^2$GSPG is available at https://github.com/HaotianLiu123/CCGSPG.

C$^2$GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning

TL;DR

The paper addresses overconfidence in RL-based reasoning for post-trained LLMs and introduces Confidence-calibrated Group Sequence Policy Gradient (C^2GSPG). By defining sequence-level confidence c_theta,i from normalized sequence probabilities and coupling a BCE regularizer with a group sequence policy gradient, the method eliminates token-level bias and aligns optimization with calibration, especially in binary rewards. It extends to non-binary rewards via nonlinear reward normalization and adaptive clipping to reduce gradient conflicts, and demonstrates superior reasoning accuracy and confidence calibration on logical and mathematical tasks, with open-source code available. The approach advances self-aware reasoning by making model confidence reflect actual performance and by stabilizing training across diverse reasoning tasks and model scales.

Abstract

Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models. However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models. In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called CGSPG, which simultaneously enhances reasoning performance while suppressing overconfidence. In principle, we propose a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly appearing in GRPO and its variants. In this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence's reward. We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction. For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating the potential conflict between the two objectives. Applying CGSPG to post-train large language models in logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration. The code of CGSPG is available at https://github.com/HaotianLiu123/CCGSPG.

Paper Structure

This paper contains 29 sections, 1 theorem, 29 equations, 7 figures, 13 tables, 1 algorithm.

Key Result

Proposition 3.1

Given binary rewards $r_i \in \{0, 1\}$, the group mean reward $m \in (0, 1)$ and the model's confidence $c_{\theta, i} \in (0, 1)$, the policy optimization direction $r_i-m$ and the regularization direction $r_i-c_{\theta,i}$ are consistent, that is, they always share the same sign.

Figures (7)

  • Figure 1: The comparison for various methods on the consistency between model confidences and rewards. The panel (a) presents reliability diagrams of various methods on six mathematical reasoning tasks, demonstrating our method's effective calibration. The panel (b) shows Expected Calibration Error (ECE) against Accuracy on the "Knights and Knaves" logic puzzle (K&K) dataset xie2025memorizationlargelanguagemodels_kk, where our method reaches the best performance quadrant (high accuracy and low ECE).
  • Figure 2: The normalization of the rewards in the K&K dataset xie2025memorizationlargelanguagemodels_kk.
  • Figure 3: Training dynamics on the K&K dataset. The right shows the evolution of Accuracy and ECE on the training set. The left shows the same metrics on the test set.
  • Figure 4: Calibration performance of various methods on the K&K dataset (test set). The top plot shows the Fraction of Positives against the model's predicted confidence, with the dashed line representing perfect calibration. The confidence histogram (bottom) shows the distribution of predicted confidences. Results are generated using a sampling strategy with a temperature of 1.0 and allowing the entire probability distribution to be considered.
  • Figure 5: Training dynamics on the mathematical dataset. The right shows the evolution of Accuracy and ECE on the training set. The left shows the same metrics on the test set.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 3.1