Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation
Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam
TL;DR
The paper addresses hallucinations in healthcare LLM outputs by introducing an independent, LLM-free fact-checking module and a domain-adapted summarization model trained on MIMIC-III. It combines LoRA-based fine-tuning of LLaMA-3.1-8B with proposition-level verification against structured EHR data, achieving high precision (0.8904) and strong overall F1 (0.8556) on 3,786 propositions from 104 discharge summaries, while maintaining competitive summary quality (ROUGE/BERTScore). The approach emphasizes transparency, reproducibility, and safety for clinical decision support, and demonstrates significant improvements over existing baselines in factual grounding. Future work includes expanding domain coverage with richer ontologies and causal reasoning, and incorporating clinician-in-the-loop mechanisms for repair suggestions and escalation.
Abstract
In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.
