Table of Contents
Fetching ...

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi

TL;DR

The paper tackles confidence calibration for LLMs in realistic multi-answer QA, where many questions have several correct answers. It introduces MACE, a 12,000-question benchmark across six domains with ground-truth counts in {1,2,4,6}, to reveal miscalibration of training-free methods as answer cardinality grows. It shows that accuracy improves but confidence collapses under mixed-answer settings across 15 methods and four model families, especially for larger models. To address this, it proposes Semantic Confidence Aggregation (SCA), which aggregates token-level confidence across multiple high-probability sampled responses, outperforming baselines on mixed-answer calibration while preserving performance on single-answer questions. The work advances calibration research beyond single-answer QA and provides a model-agnostic, efficient approach with practical implications for reliability and decision-making in real-world AI systems.

Abstract

Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

TL;DR

The paper tackles confidence calibration for LLMs in realistic multi-answer QA, where many questions have several correct answers. It introduces MACE, a 12,000-question benchmark across six domains with ground-truth counts in {1,2,4,6}, to reveal miscalibration of training-free methods as answer cardinality grows. It shows that accuracy improves but confidence collapses under mixed-answer settings across 15 methods and four model families, especially for larger models. To address this, it proposes Semantic Confidence Aggregation (SCA), which aggregates token-level confidence across multiple high-probability sampled responses, outperforming baselines on mixed-answer calibration while preserving performance on single-answer questions. The work advances calibration research beyond single-answer QA and provides a model-agnostic, efficient approach with practical implications for reliability and decision-making in real-world AI systems.

Abstract

Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.
Paper Structure (56 sections, 3 equations, 9 figures, 18 tables)

This paper contains 56 sections, 3 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Failure of consistency-based calibration on multi-answer questions. Disagreement among sampled correct answers yields low estimated confidence, indistinguishable from uncertainty caused by alternation between correct and incorrect answers in single-answer settings.
  • Figure 2: The MACE benchmark construction pipeline, illustrated using the Honorable Award domain. ①② Valid triplet (e.g., Pritzker Prize winners) are retained. ③ Low-popularity triplet (e.g., obscure awards like the Ig Nobel Prize) are removed via Popularity Filter. ④ Noisy triplet containing invalid formats (e.g., URLs) are removed via a Validity Filter. ⑤ Factually incorrect triplet are identified and removed during the Manual Verification. Finally, QA pairs are generated using triplet under four ground-truth counts settings (1a, 2a, 4a, 6a).
  • Figure 3: AUROC of LLaMA-3.1-70B-Instruct on mixed questions with varying counts of correct answers.
  • Figure 4: QA performance and confidence variation with the number of correct answers across model scales.
  • Figure 5: Knowledge coverage across ground-truth (GT) set sizes. Different color schemes denote different model families. Color intensity increases from light to dark to indicate increasing model size.
  • ...and 4 more figures