Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Yuhan Wang; Shiyu Ni; Zhikai Ding; Zihang Zhan; Yuanzi Li; Keping Bi

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi

TL;DR

The paper tackles confidence calibration for LLMs in realistic multi-answer QA, where many questions have several correct answers. It introduces MACE, a 12,000-question benchmark across six domains with ground-truth counts in {1,2,4,6}, to reveal miscalibration of training-free methods as answer cardinality grows. It shows that accuracy improves but confidence collapses under mixed-answer settings across 15 methods and four model families, especially for larger models. To address this, it proposes Semantic Confidence Aggregation (SCA), which aggregates token-level confidence across multiple high-probability sampled responses, outperforming baselines on mixed-answer calibration while preserving performance on single-answer questions. The work advances calibration research beyond single-answer QA and provides a model-agnostic, efficient approach with practical implications for reliability and decision-making in real-world AI systems.

Abstract

Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

TL;DR

Abstract

Paper Structure (56 sections, 3 equations, 9 figures, 18 tables)

This paper contains 56 sections, 3 equations, 9 figures, 18 tables.

Introduction
Related Work
Calibration in Large Language Models.
Datasets for Confidence Estimation.
The MACE Benchmark
Overview
Knowledge Collection
Domain Identification.
Triplet Collection.
Knowledge Filtering
Heuristic-Based Filtering.
Manual Verification.
QA Pair Generation
Construction Characteristics
Experimental Setup
...and 41 more sections

Figures (9)

Figure 1: Failure of consistency-based calibration on multi-answer questions. Disagreement among sampled correct answers yields low estimated confidence, indistinguishable from uncertainty caused by alternation between correct and incorrect answers in single-answer settings.
Figure 2: The MACE benchmark construction pipeline, illustrated using the Honorable Award domain. ①② Valid triplet (e.g., Pritzker Prize winners) are retained. ③ Low-popularity triplet (e.g., obscure awards like the Ig Nobel Prize) are removed via Popularity Filter. ④ Noisy triplet containing invalid formats (e.g., URLs) are removed via a Validity Filter. ⑤ Factually incorrect triplet are identified and removed during the Manual Verification. Finally, QA pairs are generated using triplet under four ground-truth counts settings (1a, 2a, 4a, 6a).
Figure 3: AUROC of LLaMA-3.1-70B-Instruct on mixed questions with varying counts of correct answers.
Figure 4: QA performance and confidence variation with the number of correct answers across model scales.
Figure 5: Knowledge coverage across ground-truth (GT) set sizes. Different color schemes denote different model families. Color intensity increases from light to dark to indicate increasing model size.
...and 4 more figures

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

TL;DR

Abstract

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)