Table of Contents
Fetching ...

CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song

TL;DR

The paper tackles the challenge of verbalized confidence calibration in LLMs for high-stakes scenarios. It introduces critique-based learning, specifically Self-Critique and CritiCal, with CritiCal using teacher-generated natural language critiques to calibrate confidence expressions. Empirical results show CritiCal outperforms Self-Critique and baselines, even beating GPT-4o on complex reasoning, and generalizes well to out-of-distribution data. The work also delineates what to critique (uncertainty for open-ended tasks, confidence for MC tasks) and how to critique (SFT with critique supervision, with DPO as an alternative), offering a scalable path to more reliable verbalized confidence in AI systems.

Abstract

Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.

CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

TL;DR

The paper tackles the challenge of verbalized confidence calibration in LLMs for high-stakes scenarios. It introduces critique-based learning, specifically Self-Critique and CritiCal, with CritiCal using teacher-generated natural language critiques to calibrate confidence expressions. Empirical results show CritiCal outperforms Self-Critique and baselines, even beating GPT-4o on complex reasoning, and generalizes well to out-of-distribution data. The work also delineates what to critique (uncertainty for open-ended tasks, confidence for MC tasks) and how to critique (SFT with critique supervision, with DPO as an alternative), offering a scalable path to more reliable verbalized confidence in AI systems.

Abstract

Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.

Paper Structure

This paper contains 20 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: In-domain comparisons between CritiCal and other SFT methods by DeepSeek-R1-Distill-Qwen-7B on MATH-Perturb, showing CritiCal’s huge potential in improving LLM's confidence calibration even with a teacher model having worse calibration performance.
  • Figure 2: Comparisons between CritiCal and traditional confidence calibration methods.
  • Figure 3: Mean ECE and AUROC values for each model across the same category of benchmarks. The dark bars are the result under uncertainty prompt, and the light ones are of confidence. Further analysis under the setting of multi-turn Self-Critique can be found in Appendix \ref{['sec:Self-Critique_Results']}.
  • Figure 4: Standard deviation of multi-turn Self-Critique for ECE and AUROC across three benchmarks. Each bar represents the standard deviation of a model's performance (uncertainty or confidence) across 6 iterations, where iteration 0 denotes the original response and iterations 1–5 indicate Self-Critique. Benchmarks are selected as representative of their task category due to question similarity under the same type. Full results are in Appendix \ref{['sec:Self-Critique_Results']}.
  • Figure 5: Multi-turn Self-Critique results on ComparisonQA, StrategyQA, and MATH-perturb benchmarks. Each plot shows the smoothed mean performance (solid line) and the corresponding 1/3 standard deviation range (shaded area) for ACC, ECE, and AUROC. Iteration 0 represents the original response without Self-Critique.
  • ...and 7 more figures