Table of Contents
Fetching ...

VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Philip Chung, Akshay Swaminathan, Alex J. Goodell, Yeasul Kim, S. Momsen Reincke, Lichy Han, Ben Deverett, Mohammad Amin Sadeghi, Abdel-Badih Ariss, Marc Ghanem, David Seong, Andrew A. Lee, Caitlin E. Coombes, Brad Bradshaw, Mahir A. Sufian, Hyo Jung Hong, Teresa P. Nguyen, Mohammad R. Rasouli, Komal Kamra, Mark A. Burbridge, James C. McAvoy, Roya Saffary, Stephen P. Ma, Dev Dash, James Xie, Ellen Y. Wang, Clifford A. Schmiesing, Nigam Shah, Nima Aghaeepour

TL;DR

VeriFact presents a general framework to guardrail long-form clinical text by verifying LLM-generated content against a patient’s electronic health record (EHR) using retrieval-augmented generation and an LLM-as-a-Judge. It introduces VeriFact-BHC, a large, openly available dataset of 13,290 propositions across 100 MIMIC-III cases, with clinician-grounded labels and adjudication. The study shows VeriFact achieving up to 92.7% agreement with denoised ground truth—on par with, or exceeding, average clinician agreement—demonstrating the system’s potential to reliably evaluate LLM-generated clinical text. By leveraging open-source foundation models and a modular, scalable retrieval-and-reasoning design, VeriFact can serve as a practical guardrail for LLM-based EHR applications and beyond, while outlining clear avenues for improvement and future research.

Abstract

Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.

VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

TL;DR

VeriFact presents a general framework to guardrail long-form clinical text by verifying LLM-generated content against a patient’s electronic health record (EHR) using retrieval-augmented generation and an LLM-as-a-Judge. It introduces VeriFact-BHC, a large, openly available dataset of 13,290 propositions across 100 MIMIC-III cases, with clinician-grounded labels and adjudication. The study shows VeriFact achieving up to 92.7% agreement with denoised ground truth—on par with, or exceeding, average clinician agreement—demonstrating the system’s potential to reliably evaluate LLM-generated clinical text. By leveraging open-source foundation models and a modular, scalable retrieval-and-reasoning design, VeriFact can serve as a practical guardrail for LLM-based EHR applications and beyond, while outlining clear avenues for improvement and future research.

Abstract

Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.

Paper Structure

This paper contains 77 sections, 11 figures, 18 tables.

Figures (11)

  • Figure 1: Schematic of the VeriFact system. VeriFact decomposes long-form input text such as a Brief Hospital Course narrative into a set of propositions for more detailed evaluation. Each patient’s EHR is also decomposed into a set of facts. For each proposition, VeriFact dynamically retrieves only the most relevant facts in the EHR to form a reference context specific to that proposition. Subsequently, each proposition and its corresponding reference context of facts are presented to an LLM-as-a-Judge tasked with determining whether each proposition is Supported, Not Supported, or Not Addressed by the facts in the EHR. Multiple foundation models are used in VeriFact: (1) “LLM” refers to a Llama 3.1 70B large language model that is used to decompose text into propositions and facts. It is also applied as a judge to read and compare each proposition with its reference context to perform a classification task. (2) “Embed” refers to the BGE-M3 bi-encoder language model that produces vector representations of propositions and EHR facts for vector search. Dense vector search utilizes semantic similarity whereas sparse vectors search utilizes token & lexical similarity to perform information retrieval. (3) “Rerank” refers to the BGE-M3 Reranker cross-encoder language model that can jointly attend to both proposition and fact and assign a more nuanced ranking score than bi-encoder language models. Experiments are conducted using only open-source foundation models to maximize transparency and reproducibility, but similar foundation models can be substituted.
  • Figure 2: (A) An LLM-written Brief Hospital Course narrative that summarizes the hospital admission, created by applying iterative rolling summarization across all of the patient’s EHR clinical notes from the admission. (B)VeriFact Evaluation Score Sheet output when verifying the LLM-written Brief Hospital Course with respect to the patient’s EHR clinical notes. The score sheet provides an overview of how much of the text is Supported, Not Supported, and Not Addressed, and a summary explanation for the scores. The summary explanations are generated by summarizing verdict explanations for each proposition in the label category. (C) Illustration of detailed proposition-level information presented to the LLM-as-a-Judge along with the assigned verdict and explanation. VeriFact jointly considers each proposition with the corresponding EHR Facts Reference Context to assign a verdict and generate a reason for the verdict. The Proposition column shows atomic claim propositions extracted from the LLM-written Brief Hospital Course summary in 2A. The EHR Facts Reference Context illustrates “Relative Time” formatting where the current time is set to the time of discharge (when the Brief Hospital Course narrative input text would be composed) and all other timestamps are converted into days and hours relative to the current time.
  • Figure 3: Top: Count and percent distribution of valid and invalid propositions for each of the author types (LLM-written vs. human-written Brief Hospital Course text) and proposition types (Atomic Claim vs. Sentence Proposition). Bottom: Breakdown of reasons for why propositions were invalid. Each proposition can have multiple reasons for being invalid. Sentence propositions from human-written text in particular exhibit many invalid propositions predominantly due to incomplete and vague statements. LLM-written text or atomic claim propositions which require an LLM to perform extraction rarely result in invalid propositions.
  • Figure 4: Top: Percent distribution of Supported, Not Supported, and Not Addressed propositions as assigned by human clinician ground truth (solid bars) and VeriFact (hatched bars). The VeriFact system whose label assignments are depicted utilize the following hyperparameters: Retrieval Method = “Rerank”, Top N = 50, Reference Context Format = “Absolute Time”, Retrieve Facts Only From Current Admission = “No”. Other VeriFact systems with different hyperparameters will have different label distributions. Human-written text used in this study contains information that does not appear in the reference EHR notes, leading to a high number of propositions that cannot be Supported due to intrinsic information asymmetry between the text generation and evaluation process. In contrast, LLM-written text is generated using the same reference EHR notes as a knowledge source, leading to a high fraction of Supported propositions due to the symmetric information utilization in text generation and evaluation processes. Bottom: Sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for each verdict assignment of the VeriFact system using the human clinician ground truth as the gold standard. “Not Supported or Addressed” refers to the negative verdict label that arises when the task is binarized by combining Not Supported and Not Addressed. PPV for Not Addressed for LLM-written Text, Sentence Propositions is undefined because zero sentence propositions from LLM-written text are assigned Not Addressed verdicts by VeriFact.
  • Figure 5: Left: Relative importance of hyperparameters as measured by the Percent Agreement achieved between VeriFact and the human clinician ground truth labels. The plots show the effect of varying Top N (number of retrieved EHR facts), retrieval method, reference context format, and whether to limit retrieval scope to current admission. Each plot shows a sensitivity analysis of one of the four hyperparameters that is varied along the plot’s x-axis while the other three hyperparameters are fixed to a default value. The default values are: Top N = 10, Retrieval Method = “Dense”, Reference Context Format = “Relevance Score”, Retrieve Facts Only From Current Admission = “No”. Each line represents VeriFact performance on a specific combination of author type and proposition type. Area behind the line depicts 95% confidence intervals. Right: A depiction of how label assignment changes when transitioning from a weak VeriFact system (Top N=5) to a strong VeriFact system (Top N=50). “A” represents the label assigned by the weakest VeriFact system. The next stronger VeriFact system may assign the same label, or may assign a different label as denoted by “B” and “C”. The strongest VeriFact system is denoted in brackets and its label is compared against the Ground Truth and the resulting proposition counts are shown as bars. A majority of propositions are assigned the same label regardless of the amount of information retrieved from the EHR. However, a significant fraction of propositions have a change in label assignment as more facts are retrieved, and more often than not, this change in label assignment results in alignment with the human clinician ground truth label. In this analysis, propositions are pooled across author types and proposition types.
  • ...and 6 more figures