MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry
TL;DR
This work introduces MedHalu, the first benchmark for studying hallucinations in LLMs during real-world healthcare queries, along with MedHaluDetect, a framework to evaluate hallucination detection across LLMs, medical experts, and laypeople. The dataset amalgamates HealthQA, LiveQA, and MedicationQA and includes fine-grained labeling of hallucination types (input-conflicting, context-conflicting, fact-conflicting) and text spans, with expert annotations confirming reliability (Cohen’s Kappa = 0.83). Key findings show that medical experts outperform laypeople, while LLMs generally underperform humans in detection, prompting an expert-in-the-loop approach that injects expert reasoning into prompts and yields notable improvements (e.g., GPT-4 macro-F1 gains of around 6.3 percentage points). The work also benchmarks cross-model and cross-group capabilities, releases a substantial dataset, and outlines future directions such as integrating knowledge graphs and extending to multilingual/multimodal settings to enhance safety and reliability in healthcare AI systems.
Abstract
Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search. LLM-powered chatbots like ChatGPT are gaining prominence among the general public. AI chatbots are also increasingly producing content on social media platforms. However, LLMs are also prone to hallucinations, generating plausible yet factually incorrect or fabricated information. This becomes a critical problem when laypeople start seeking information about sensitive issues such as healthcare. Existing works in LLM hallucinations in the medical domain mainly focus on testing the medical knowledge of LLMs through standardized medical exam questions which are often well-defined and clear-cut with definitive answers. However, these approaches may not fully capture how these LLMs perform during real-world interactions with patients. This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients.We introduce MedHalu, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and text spans. We also propose MedHaluDetect, a comprehensive framework for evaluating LLMs' abilities to detect hallucinations. Furthermore, we study the vulnerability to medical hallucinations among three groups -- medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople in detecting medical hallucinations. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4. Our code and dataset are available at https://netsys.surrey.ac.uk/datasets/medhalu/.
