Table of Contents
Fetching ...

Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

Liangjie Huang, Dawei Li, Huan Liu, Lu Cheng

TL;DR

Problem: Understanding how self-improvement in LLMs affects confidence estimation and reliability. Approach: compare basic prompting, Chain-of-Thought prompting, and supervised fine-tuning as self-improvement methods, and test three calibration strategies—calibrate after multiple rounds, calibrate before self-improvement, and iterative calibration at each step—on the MMLU benchmark with $ECE$ as the calibration metric. Key findings: iterative self-improvement tends to increase overconfidence and $ECE$, but applying calibration—especially iteratively at each step—substantially reduces $ECE$ and improves calibrated confidence estimates; calibration-before self-improvement also yields gains for stronger models, while results vary by model. Significance: provides a calibration-centric view of self-improving LLMs and guides practical design choices to balance accuracy and reliability in real-world, high-stakes tasks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases-most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms-basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.

Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

TL;DR

Problem: Understanding how self-improvement in LLMs affects confidence estimation and reliability. Approach: compare basic prompting, Chain-of-Thought prompting, and supervised fine-tuning as self-improvement methods, and test three calibration strategies—calibrate after multiple rounds, calibrate before self-improvement, and iterative calibration at each step—on the MMLU benchmark with as the calibration metric. Key findings: iterative self-improvement tends to increase overconfidence and , but applying calibration—especially iteratively at each step—substantially reduces and improves calibrated confidence estimates; calibration-before self-improvement also yields gains for stronger models, while results vary by model. Significance: provides a calibration-centric view of self-improving LLMs and guides practical design choices to balance accuracy and reliability in real-world, high-stakes tasks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases-most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms-basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.

Paper Structure

This paper contains 13 sections, 8 equations, 5 figures.

Figures (5)

  • Figure 1: The two research questions and overview of our exploration process in this work.
  • Figure 2: Results of Self-Improvement in Different Methods.
  • Figure 3: Llama-deepSeek's accuracy and confidence distribution.
  • Figure 4: Llama's accuracy and confidence distribution.
  • Figure 5: Self-Improve and Calibration Relationship Experiment Result.