Table of Contents
Fetching ...

Multicalibration for Confidence Scoring in LLMs

Gianluca Detommaso, Martin Bertran, Riccardo Fogliato, Aaron Roth

TL;DR

This work tackles the challenge of trustworthy confidence scoring for LLM outputs by introducing multicalibration, which enforces calibration not only overall but across many intersecting groups defined by prompts. The authors develop and compare grouping strategies (embedding-based clustering and self-annotation) and novel multicalibration algorithms, notably IGHB and its overfitting remedies (IGLB), to produce calibrated probabilities of non-hallucination. Across diverse QA datasets and several LLMs, multicalibrated scores deliver substantial improvements in calibration error and accuracy over traditional calibration, with IGLB and GCULR often achieving the best performance. The framework is extensible to new grouping methods and scoring functions, enabling more reliable, interpretable risk assessments in real-world deployments.

Abstract

This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs). Multicalibration asks for calibration not just marginally, but simultaneously across various intersecting groupings of the data. We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" - querying the LLM by asking it various yes-or-no questions about the prompt. We also develop novel variants of multicalibration algorithms that offer performance improvements by reducing their tendency to overfit. Through systematic benchmarking across various question answering datasets and LLMs, we show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.

Multicalibration for Confidence Scoring in LLMs

TL;DR

This work tackles the challenge of trustworthy confidence scoring for LLM outputs by introducing multicalibration, which enforces calibration not only overall but across many intersecting groups defined by prompts. The authors develop and compare grouping strategies (embedding-based clustering and self-annotation) and novel multicalibration algorithms, notably IGHB and its overfitting remedies (IGLB), to produce calibrated probabilities of non-hallucination. Across diverse QA datasets and several LLMs, multicalibrated scores deliver substantial improvements in calibration error and accuracy over traditional calibration, with IGLB and GCULR often achieving the best performance. The framework is extensible to new grouping methods and scoring functions, enabling more reliable, interpretable risk assessments in real-world deployments.

Abstract

This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs). Multicalibration asks for calibration not just marginally, but simultaneously across various intersecting groupings of the data. We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" - querying the LLM by asking it various yes-or-no questions about the prompt. We also develop novel variants of multicalibration algorithms that offer performance improvements by reducing their tendency to overfit. Through systematic benchmarking across various question answering datasets and LLMs, we show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
Paper Structure (30 sections, 5 theorems, 27 equations, 3 figures, 6 tables, 5 algorithms)

This paper contains 30 sections, 5 theorems, 27 equations, 3 figures, 6 tables, 5 algorithms.

Key Result

Proposition 2.4

We have

Figures (3)

  • Figure 1: An application of multicalibration to question answering. Answers are colored from red to green according to their multicalibrated confidence scores of being a hallucination. Multicalibration is performed using Algorithm \ref{['alg:iglb']}.
  • Figure 2: Average scores against accuracies across various clusters, for each method, on MMLU for StableBeluga-13B. Colors represent the groups, and the size of the points reflects their size. Multicalibration methods exhibit significantly superior alignment with the diagonal compared to standard calibration methods. In agreement with the results in Table \ref{['tab:mean_std_results']}, IGLB and GCULR stand out as the top performers.
  • Figure 3: The average scores against the accuracy across various clusters, for each calibration method, and for inverse perplexity and multiple-choice softmax scores on MMLU and StableBeluga-13B. Conclusions are similar to those derived for Figure \ref{['fig:group_rel_plot']}.

Theorems & Definitions (11)

  • Definition 2.1: Calibration
  • Definition 2.2: Average squared calibration error
  • Definition 2.3: Mean squared error
  • Proposition 2.4: See, e.g. kohavi1996bias
  • Theorem 2.5: See, e.g. roth2022uncertain
  • Definition 2.6: Group-conditional unbiasedness
  • Theorem 2.7
  • Definition 2.8: Group average squared calibration error
  • Definition 2.9: Multicalibration
  • Theorem 2.10
  • ...and 1 more