Table of Contents
Fetching ...

MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty

Yongjin Yang, Haneul Yoo, Hwaran Lee

TL;DR

This work introduces MAQA, a 2,042-question benchmark designed to evaluate uncertainty quantification in the presence of data uncertainty (multi-answer questions) across world knowledge, mathematical reasoning, and commonsense reasoning. It systematically assesses five UQ methods across white-box and black-box LLMs, revealing that data uncertainty challenges traditional logit- and response-based uncertainty signals, though entropy and response consistency remain effective in various settings. The study shows that uncertainty quantification benefits from decomposing data and model uncertainty in a task-specific manner and highlights that multi-answer scenarios can degrade AUROC more than single-answer cases, depending on the task and model. The findings provide practical guidance for developing more reliable UQ methods in realistic, multi-answer contexts and point to future directions in leveraging probabilistic outputs of LLMs to disentangle uncertainty sources. MAQA thus offers a realistic benchmark to guide the design of uncertainty-aware LLM systems in real-world applications.

Abstract

Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty. We believe these observations will guide future work on uncertainty quantification in more realistic settings.

MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty

TL;DR

This work introduces MAQA, a 2,042-question benchmark designed to evaluate uncertainty quantification in the presence of data uncertainty (multi-answer questions) across world knowledge, mathematical reasoning, and commonsense reasoning. It systematically assesses five UQ methods across white-box and black-box LLMs, revealing that data uncertainty challenges traditional logit- and response-based uncertainty signals, though entropy and response consistency remain effective in various settings. The study shows that uncertainty quantification benefits from decomposing data and model uncertainty in a task-specific manner and highlights that multi-answer scenarios can degrade AUROC more than single-answer cases, depending on the task and model. The findings provide practical guidance for developing more reliable UQ methods in realistic, multi-answer contexts and point to future directions in leveraging probabilistic outputs of LLMs to disentangle uncertainty sources. MAQA thus offers a realistic benchmark to guide the design of uncertainty-aware LLM systems in real-world applications.

Abstract

Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty. We believe these observations will guide future work on uncertainty quantification in more realistic settings.
Paper Structure (57 sections, 1 equation, 6 figures, 12 tables)

This paper contains 57 sections, 1 equation, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Evaluation settings with and without data uncertainty. When asking for a single label set, the probability distribution can be used to estimate the model uncertainty. On the other hand, when evaluating a question that has multiple answers, it may become difficult to distinguish between model uncertainty and data uncertainty, due to the existence of multiple possible answers.
  • Figure 2: (a) Maximum probability of correct answer by the number of answers when evaluated on world knowledge part of MAQA. The number of answers clearly affects the probability value, indicating the data uncertainty. (b) Sum of top 5 probabilities of correct answer per each number of answers, which seems constant across answer count. The results are averaged over three 7-8B models. (c) Max probabilities of correct answers per each answer position when evaluated on reasoning tasks. LLMs tend to be overconfident, especially after the first answer.
  • Figure 3: (a) Maximum probability of correct answer per answer position by different prompting methods. CoT prompting clearly increases the confidence score. (b) Histogram of verbalized confidence values when evaluated on MAQA. LLMs tend to be overconfident, as confidence score is concentrated in the range of 80-100. (c) Response consistency per answer count. Score is averaged over all datasets. For all three results, Llama-3-8b is utilized.
  • Figure 4: AUROC scores by precision for MAQA using uncertainty quantification methods: (a) Max Softmax Logit, (b) Entropy, (c) Margin.
  • Figure 5: AUROC scores by precision for MAQA using uncertainty quantification methods: Verbalized Confidence and Response Consistency.
  • ...and 1 more figures