Calibrating the Confidence of Large Language Models by Eliciting Fidelity
Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, Xipeng Qiu
TL;DR
<3-5 sentence high-level summary> This work addresses the overconfidence of RLHF-tuned LLMs by decomposing model confidence into Uncertainty about the question and Fidelity to the chosen answer. It introduces UF Calibration, a plug-and-play method that estimates confidence without requiring per-token logits, using sampling and hierarchical fidelity chains to derive a final confidence score. The authors also propose two new calibration metrics, IPR and CE, and demonstrate robust calibration improvements across six RLHF-LMs on four MCQA benchmarks, along with analyses of truly well-calibrated confidence. The approach provides a strong baseline for eliciting model confidence and offers diagnostic tools to understand confidence distribution, with limitations in open-ended generation and query efficiency noted.
Abstract
Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by language models. Then, we propose a plug-and-play method to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.
