Table of Contents
Fetching ...

Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam

TL;DR

The paper addresses hallucinations in healthcare LLM outputs by introducing an independent, LLM-free fact-checking module and a domain-adapted summarization model trained on MIMIC-III. It combines LoRA-based fine-tuning of LLaMA-3.1-8B with proposition-level verification against structured EHR data, achieving high precision (0.8904) and strong overall F1 (0.8556) on 3,786 propositions from 104 discharge summaries, while maintaining competitive summary quality (ROUGE/BERTScore). The approach emphasizes transparency, reproducibility, and safety for clinical decision support, and demonstrates significant improvements over existing baselines in factual grounding. Future work includes expanding domain coverage with richer ontologies and causal reasoning, and incorporating clinician-in-the-loop mechanisms for repair suggestions and escalation.

Abstract

In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

TL;DR

The paper addresses hallucinations in healthcare LLM outputs by introducing an independent, LLM-free fact-checking module and a domain-adapted summarization model trained on MIMIC-III. It combines LoRA-based fine-tuning of LLaMA-3.1-8B with proposition-level verification against structured EHR data, achieving high precision (0.8904) and strong overall F1 (0.8556) on 3,786 propositions from 104 discharge summaries, while maintaining competitive summary quality (ROUGE/BERTScore). The approach emphasizes transparency, reproducibility, and safety for clinical decision support, and demonstrates significant improvements over existing baselines in factual grounding. Future work includes expanding domain coverage with richer ontologies and causal reasoning, and incorporating clinician-in-the-loop mechanisms for repair suggestions and escalation.

Abstract

In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

Paper Structure

This paper contains 41 sections, 15 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The figure shows a comparison between traditional LLM only summarization in healthcare and our proposed LLM with the integration of Fact-checking module. The left side shows that the conventional LLM may produce hallucinated outputs, which can be factually incorrect (e.g., incorrect medication dosage is w). These errors can lead to risky medical decisions. On the other hand, the right side shows the illustration of the benefit of using a fine-tuned specific LLM model with the fact-checking module that verifies each claim against the EHR data.
  • Figure 2: The workflow illustrates how LoRA fine-tuning is applied to a large language model (LLM) to generate medical summaries from patient EHRs. The generated summaries and EHR source records are decomposed into structured propositions, which are then compared using consistency checkers to evaluate factual alignment. Contradictions and unsupported claims are identified through the logical consistency rules, which later results with a verdict assignment for each proposition (Supported or Not Supported).
  • Figure 3: Illustrative examples of proposition-level factual verification outcomes are shown here. Each row shows a proposition based on a summary, its EHR reference statement, and the factual verdict that was given after running through the fact checking module. The figure shows different kinds of consistency checks that the verification engine does. Few of the checks are shown here, such as numerical, presence, temporal, negation, implication, and mutual exclusivity checks. Propositions failing one or more checks are marked as Not Supported, and those fully aligned and validated against the EHR are labeled as Supported. These highlights the deterministic operation of the multi-layered fact-checking pipeline.
  • Figure 4: End-to-end fine-tuning pipeline illustrating preprocessing, tokenization, LoRA-based adaptation, and evaluation integration.
  • Figure 5: Training loss convergence over fine-tuning steps.
  • ...and 1 more figures