Multicalibration for Confidence Scoring in LLMs
Gianluca Detommaso, Martin Bertran, Riccardo Fogliato, Aaron Roth
TL;DR
This work tackles the challenge of trustworthy confidence scoring for LLM outputs by introducing multicalibration, which enforces calibration not only overall but across many intersecting groups defined by prompts. The authors develop and compare grouping strategies (embedding-based clustering and self-annotation) and novel multicalibration algorithms, notably IGHB and its overfitting remedies (IGLB), to produce calibrated probabilities of non-hallucination. Across diverse QA datasets and several LLMs, multicalibrated scores deliver substantial improvements in calibration error and accuracy over traditional calibration, with IGLB and GCULR often achieving the best performance. The framework is extensible to new grouping methods and scoring functions, enabling more reliable, interpretable risk assessments in real-world deployments.
Abstract
This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs). Multicalibration asks for calibration not just marginally, but simultaneously across various intersecting groupings of the data. We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" - querying the LLM by asking it various yes-or-no questions about the prompt. We also develop novel variants of multicalibration algorithms that offer performance improvements by reducing their tendency to overfit. Through systematic benchmarking across various question answering datasets and LLMs, we show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
