Table of Contents
Fetching ...

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Zhe Yang, Yichang Zhang, Yudong Wang, Ziyao Xu, Junyang Lin, Zhifang Sui

TL;DR

This work tackles self-correction in large language systems by decomposing it into two core capabilities: Confidence in correct answers and the ability to critique and fix incorrect ones. It introduces probabilistic metrics—Confidence Level (CL), Critique Score (CS), and the Relative Self-Correction Score (RSS)—and proves that post-correction accuracy $Acc_2$ is a weighted sum of these components via $Acc_2 = Acc_1 \cdot CL + (1-Acc_1) \cdot CS$, with bounds that enable fair comparisons across models. Through extensive experiments across open- and closed-source models and diverse tasks, the authors observe trade-offs between CL and CS and demonstrate that prompts or ICL alone cannot simultaneously optimize both. The paper then proposes Confidence and Critique Improvement Tuning (CCT), a simple data-transformation-based training strategy that combines CLT and CST within a unified fine-tuning regime, outperforming vanilla SFT in $Acc_2$, CL, and CS and enabling high post-correction accuracy, even when combined with SFT. Overall, the decomposition, metrics, and CCT approach offer a practical framework for diagnosing and enhancing self-correction in LLMs, with broad implications for robust QA and reasoning tasks.

Abstract

Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction. Our code will be publicly available on GitHub.

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

TL;DR

This work tackles self-correction in large language systems by decomposing it into two core capabilities: Confidence in correct answers and the ability to critique and fix incorrect ones. It introduces probabilistic metrics—Confidence Level (CL), Critique Score (CS), and the Relative Self-Correction Score (RSS)—and proves that post-correction accuracy is a weighted sum of these components via , with bounds that enable fair comparisons across models. Through extensive experiments across open- and closed-source models and diverse tasks, the authors observe trade-offs between CL and CS and demonstrate that prompts or ICL alone cannot simultaneously optimize both. The paper then proposes Confidence and Critique Improvement Tuning (CCT), a simple data-transformation-based training strategy that combines CLT and CST within a unified fine-tuning regime, outperforming vanilla SFT in , CL, and CS and enabling high post-correction accuracy, even when combined with SFT. Overall, the decomposition, metrics, and CCT approach offer a practical framework for diagnosing and enhancing self-correction in LLMs, with broad implications for robust QA and reasoning tasks.

Abstract

Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction. Our code will be publicly available on GitHub.
Paper Structure (40 sections, 14 equations, 10 figures, 8 tables)

This paper contains 40 sections, 14 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An example of four scenarios in self-correction. For a correct initial answer, LLM can (1). confidently maintain it or (2). unconfidently change it into a wrong answer. For a wrong initial answer, LLM can (3). critique and make it correct or (4). stubbornly insist the wrong answer.
  • Figure 2: Venn diagram for confident/critique models in complete probability space. The red, orange circles and their overlap area denote the probability of a model correctly answering questions before self-correction, after self-correction, and both respectively. the overlap area of confident models is much larger than that of critical models.
  • Figure 3: Visualized expression of Relative Self-correction Score.
  • Figure 4: Relative Self-correction Score (RSS) results on GSM8k (shown in ascending order of $Acc_2$). Except for showing RSS for each evaluated model in a bar, we also show $Acc_2$, upper and lower bounds of $Acc_2$ in lines of different colors for comparison.
  • Figure 5: A trade-off between CL and CS. Confidence prompt/ICL example can lead higer CL and lower CS; critique prompt/ICL example can cause lower CL and higher CS.
  • ...and 5 more figures