Uncertainty Estimation of Large Language Models in Medical Question Answering
Jiaxin Wu, Yizhou Yu, Hong-Yu Zhou
TL;DR
This work addresses uncertainty estimation for large language models in medical question answering, where hallucinations pose critical safety risks. It evaluates existing UE approaches (entropy-based, self-assessment, external tools) and introduces Two-phase Verification, a probability-free pipeline that uses step-wise explanations, verification questions, and cross-checks to detect inconsistencies. Across three biomedical datasets and two model sizes, Two-phase Verification yields the strongest and most stable uncertainty signals, and its performance scales with model size. The approach holds promise for safer deployment of LLMs in high-stakes medical contexts, though it also highlights limitations in verification question generation and domain knowledge integration that warrant further research.
Abstract
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information. Deploying LLMs for medical question answering necessitates reliable uncertainty estimation (UE) methods to detect hallucinations. In this work, we benchmark popular UE methods with different model sizes on medical question-answering datasets. Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications. We also observe that larger models tend to yield better results, suggesting a correlation between model size and the reliability of UE. To address these challenges, we propose Two-phase Verification, a probability-free Uncertainty Estimation approach. First, an LLM generates a step-by-step explanation alongside its initial answer, followed by formulating verification questions to check the factual claims in the explanation. The model then answers these questions twice: first independently, and then referencing the explanation. Inconsistencies between the two sets of answers measure the uncertainty in the original response. We evaluate our approach on three biomedical question-answering datasets using Llama 2 Chat models and compare it against the benchmarked baseline methods. The results show that our Two-phase Verification method achieves the best overall accuracy and stability across various datasets and model sizes, and its performance scales as the model size increases.
