Table of Contents
Fetching ...

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, Xipeng Qiu

TL;DR

<3-5 sentence high-level summary> This work addresses the overconfidence of RLHF-tuned LLMs by decomposing model confidence into Uncertainty about the question and Fidelity to the chosen answer. It introduces UF Calibration, a plug-and-play method that estimates confidence without requiring per-token logits, using sampling and hierarchical fidelity chains to derive a final confidence score. The authors also propose two new calibration metrics, IPR and CE, and demonstrate robust calibration improvements across six RLHF-LMs on four MCQA benchmarks, along with analyses of truly well-calibrated confidence. The approach provides a strong baseline for eliciting model confidence and offers diagnostic tools to understand confidence distribution, with limitations in open-ended generation and query efficiency noted.

Abstract

Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by language models. Then, we propose a plug-and-play method to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

TL;DR

<3-5 sentence high-level summary> This work addresses the overconfidence of RLHF-tuned LLMs by decomposing model confidence into Uncertainty about the question and Fidelity to the chosen answer. It introduces UF Calibration, a plug-and-play method that estimates confidence without requiring per-token logits, using sampling and hierarchical fidelity chains to derive a final confidence score. The authors also propose two new calibration metrics, IPR and CE, and demonstrate robust calibration improvements across six RLHF-LMs on four MCQA benchmarks, along with analyses of truly well-calibrated confidence. The approach provides a strong baseline for eliciting model confidence and offers diagnostic tools to understand confidence distribution, with limitations in open-ended generation and query efficiency noted.

Abstract

Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by language models. Then, we propose a plug-and-play method to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.
Paper Structure (31 sections, 10 equations, 13 figures, 13 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: In four different MCQA datasets, our method has demonstrated good calibration effects, meaning it is sufficiently close to the $y=x$ curve. The experimental data is derived from GPT-3.5-Turbo.
  • Figure 2: If the model's choice of answer changes after replacing the content of its previous selected option with "All other options are wrong", it could be considered that the model's fidelity to its previous answer is not high enough.
  • Figure 3: Our proposed UF Calibration, which requires at most two phases to invoke the model. In the Sampling phase, for black-box models, similar to the Sampled method, we need to sample 10 times. For white-box models, a single invocation is sufficient. In the eliciting the fidelity phase, the model needs to be invoked approximately 2 to 3 times to generate a fidelity chain, as show in Table \ref{['table: averageLength']}.
  • Figure 4: Our proposed method achieved well-calibrated results across all temperatures. The experimental results are derived from LLaMA2-13B-Chat. The results from Baichuan2-13B-Chat are presented in Appendix Figure \ref{['fig:baichuan_temperature_scaling']}.
  • Figure 5: The experimental results are derived from LLaMA2-Chat.
  • ...and 8 more figures