Table of Contents
Fetching ...

QA-Calibration of Language Model Confidence Scores

Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Aaditya Ramdas

TL;DR

QA-calibration introduces a group-conditioned calibration target for QA confidence scores, addressing limitations of average-case calibration in heterogeneous domains. It defines a fixed β mapping to partition QA pairs and proposes two posthoc schemes, QA binning and scaling QA binning, to transform arbitrary elicited scores into discretized, QA-calibrated outputs with distribution-free guarantees. The β partitioning is instantiated via an embed-then-bin approach using a DistilBERT [CLS] embedding and a kd-tree, enabling semantically coherent QA groups; UMD-based per-partition calibration and hierarchical scaling provide robust performance and data-efficient calibration. Empirical results across five QA benchmarks and two LLMs show substantial improvements in QA-calibration error and selective answering, with theoretical guarantees that accommodate label misspecification and partition data sparsity, highlighting practical impact for decision-making in generative QA systems.

Abstract

To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).

QA-Calibration of Language Model Confidence Scores

TL;DR

QA-calibration introduces a group-conditioned calibration target for QA confidence scores, addressing limitations of average-case calibration in heterogeneous domains. It defines a fixed β mapping to partition QA pairs and proposes two posthoc schemes, QA binning and scaling QA binning, to transform arbitrary elicited scores into discretized, QA-calibrated outputs with distribution-free guarantees. The β partitioning is instantiated via an embed-then-bin approach using a DistilBERT [CLS] embedding and a kd-tree, enabling semantically coherent QA groups; UMD-based per-partition calibration and hierarchical scaling provide robust performance and data-efficient calibration. Empirical results across five QA benchmarks and two LLMs show substantial improvements in QA-calibration error and selective answering, with theoretical guarantees that accommodate label misspecification and partition data sparsity, highlighting practical impact for decision-making in generative QA systems.

Abstract

To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).

Paper Structure

This paper contains 24 sections, 3 theorems, 20 equations, 4 figures, 9 tables, 4 algorithms.

Key Result

Theorem 3.1

Consider an input calibration dataset $\tilde{D}$ defined above with misspecification factor $\nu$ from Definition dfn:misspecification. Assume that the $h_i$'s are distinct, number of points per bin $b \geq 2$, and number of instances within each partition $n_s \geq b$ for every $s\in {\mathcal{S}}

Figures (4)

  • Figure 1: Two users interact separately with an LM by inputting questions and obtaining answers and confidence scores from the LM. The LM could be calibrated on average across user types, but each individual user may not have calibrated confidence scores. See Example \ref{['ex:example_1']} for more details.
  • Figure 2: The relationship between $\epsilon$, number of points per bin $b$ and misspecification constant $\nu$ in Theorem \ref{['main']}. Based on the plot, when $\nu=0$, practitioners should set $b\simeq 300$ when $N=1000$, $b\simeq 400$ when $N=5000$, $b\simeq 500$ when $N=20000$ (attaining $\epsilon=0.1$). When a ground truth proxy is misspecified (Definition \ref{['dfn:misspecification']}), e.g., $\nu=0.1$, for certain levels of $\epsilon$, the same bound can be attained with a larger $b$. For example, for achieving the same $\epsilon=0.15$, if $\nu=0$ then $b$ needs to be only approximately 250, whereas if $\nu=0.1$ then $b$ has to be $>1000$.
  • Figure 3: Scatter plot and regression lines for posthoc-calibrated scores vs confidence scores from Ling1STop1. Note that confidence scores from Ling1STop1 (shown in the the x-axis) are discretized since confidence statement is drawn from an expression list (Table \ref{['tbl:prompts_used']}). We use the OpenBookQA dataset and Mistral LLM.
  • Figure 4: The reliability plot compares the baselines None , UMD (B), and our method, QA binning (QAB), across each QA partition. Using the OpenBookQA dataset, Ling1S-Top1 prompt, a Mistral LLM, and a kd-tree with a maximum depth of two, four QA partitions are generated, and we conduct an analysis on each partition. UMD (B) and QA binning (QAB), provide confidence scores that are better calibrated at each of the partition compared to None. However, QA binning (QAB) yields better calibrated confidence scores than UMD (B).

Theorems & Definitions (13)

  • Definition 2.1: Calibration
  • Definition 2.2: Expected (Average-case) Calibration Error
  • Example 2.1
  • Definition 2.3: QA-calibration
  • Definition 2.4: QA-calibration error
  • Definition 3.1: Conditional QA-calibration
  • Definition 3.2
  • Theorem 3.1: Distribution-free QA-calibration Guarantee
  • Theorem A.1: Conditional Calibration Guarantee of Algorithm \ref{['alg:UMD']} under Label Misspecification
  • proof
  • ...and 3 more