QA-Calibration of Language Model Confidence Scores
Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Aaditya Ramdas
TL;DR
QA-calibration introduces a group-conditioned calibration target for QA confidence scores, addressing limitations of average-case calibration in heterogeneous domains. It defines a fixed β mapping to partition QA pairs and proposes two posthoc schemes, QA binning and scaling QA binning, to transform arbitrary elicited scores into discretized, QA-calibrated outputs with distribution-free guarantees. The β partitioning is instantiated via an embed-then-bin approach using a DistilBERT [CLS] embedding and a kd-tree, enabling semantically coherent QA groups; UMD-based per-partition calibration and hierarchical scaling provide robust performance and data-efficient calibration. Empirical results across five QA benchmarks and two LLMs show substantial improvements in QA-calibration error and selective answering, with theoretical guarantees that accommodate label misspecification and partition data sparsity, highlighting practical impact for decision-making in generative QA systems.
Abstract
To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).
