Table of Contents
Fetching ...

Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen, Yue Zhang, Ren Wang, Xiaoshuang Shi, Kaidi Xu

TL;DR

Word-Sequence Entropy (WSE) is introduced, a method that calibrates uncertainty at both the word and sequence levels, considering semantic relevance, in a way that is more closely aligned with the reliability of LLMs during uncertainty quantification (UQ).

Abstract

Uncertainty estimation is crucial for the reliability of safety-critical human and artificial intelligence (AI) interaction systems, particularly in the domain of healthcare engineering. However, a robust and general uncertainty measure for free-form answers has not been well-established in open-ended medical question-answering (QA) tasks, where generative inequality introduces a large number of irrelevant words and sequences within the generated set for uncertainty quantification (UQ), which can lead to biases. This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels, considering semantic relevance. WSE quantifies uncertainty in a way that is more closely aligned with the reliability of LLMs during uncertainty quantification (UQ). We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs). Experimental results demonstrate that WSE exhibits superior performance in UQ under two standard criteria for correctness evaluation. Additionally, in terms of real-world medical QA applications, the performance of LLMs is significantly enhanced (e.g., a 6.36% improvement in model accuracy on the COVID-QA dataset) by employing responses with lower uncertainty that are identified by WSE as final answers, without any additional task-specific fine-tuning or architectural modifications.

Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

TL;DR

Word-Sequence Entropy (WSE) is introduced, a method that calibrates uncertainty at both the word and sequence levels, considering semantic relevance, in a way that is more closely aligned with the reliability of LLMs during uncertainty quantification (UQ).

Abstract

Uncertainty estimation is crucial for the reliability of safety-critical human and artificial intelligence (AI) interaction systems, particularly in the domain of healthcare engineering. However, a robust and general uncertainty measure for free-form answers has not been well-established in open-ended medical question-answering (QA) tasks, where generative inequality introduces a large number of irrelevant words and sequences within the generated set for uncertainty quantification (UQ), which can lead to biases. This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels, considering semantic relevance. WSE quantifies uncertainty in a way that is more closely aligned with the reliability of LLMs during uncertainty quantification (UQ). We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs). Experimental results demonstrate that WSE exhibits superior performance in UQ under two standard criteria for correctness evaluation. Additionally, in terms of real-world medical QA applications, the performance of LLMs is significantly enhanced (e.g., a 6.36% improvement in model accuracy on the COVID-QA dataset) by employing responses with lower uncertainty that are identified by WSE as final answers, without any additional task-specific fine-tuning or architectural modifications.
Paper Structure (27 sections, 18 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 18 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The overview of WSE and its potential for improving model accuracy. Given a medical query, the language model generates the most likely generation as the output, which might be incorrect. Following prior work, we additionally generate multiple (e.g., five) candidate responses to evaluate the trustworthiness of this output. Existing entropy-based measures identify high overall uncertainty in the candidate set, causing the API to refuse to answer the most likely generation. By assessing semantic relevance at both the word and sequence levels, WSE highlights keywords and reliable sequences, resulting in calibrated uncertainty that meets the response criterion. Finally, we employ the response with the lowest uncertainty as the final output, which coincides with the reference answer.
  • Figure 2: Distribution of semantic relevance scores at both the word and sequence levels. The entire generated set contains a considerable proportion of irrelevant words and sequences (i.e., generative inequality).
  • Figure 3: Correlation between the semantic relevance and uncertainty proportion at both the word and sequence levels. Irrelevant words and sequences account for the primary source of uncertainty within the generated set (responses) in general.
  • Figure 4: The performance of $\textit{WSE}_W$, $\textit{WSE}_S$, $\textit{WSE}_C$, and five baseline methods at different thresholds of RS. Results are obtained on the COVID-QA dataset utilizing the LLaMA-2-7B-Chat model.
  • Figure 5: The performance of $\textit{WSE}_W$, $\textit{WSE}_S$, $\textit{WSE}_C$, and five baseline methods at different thresholds of SS. Results are obtained on the COVID-QA dataset utilizing the LLaMA-2-7B-Chat model.
  • ...and 2 more figures