VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records
Philip Chung, Akshay Swaminathan, Alex J. Goodell, Yeasul Kim, S. Momsen Reincke, Lichy Han, Ben Deverett, Mohammad Amin Sadeghi, Abdel-Badih Ariss, Marc Ghanem, David Seong, Andrew A. Lee, Caitlin E. Coombes, Brad Bradshaw, Mahir A. Sufian, Hyo Jung Hong, Teresa P. Nguyen, Mohammad R. Rasouli, Komal Kamra, Mark A. Burbridge, James C. McAvoy, Roya Saffary, Stephen P. Ma, Dev Dash, James Xie, Ellen Y. Wang, Clifford A. Schmiesing, Nigam Shah, Nima Aghaeepour
TL;DR
VeriFact presents a general framework to guardrail long-form clinical text by verifying LLM-generated content against a patient’s electronic health record (EHR) using retrieval-augmented generation and an LLM-as-a-Judge. It introduces VeriFact-BHC, a large, openly available dataset of 13,290 propositions across 100 MIMIC-III cases, with clinician-grounded labels and adjudication. The study shows VeriFact achieving up to 92.7% agreement with denoised ground truth—on par with, or exceeding, average clinician agreement—demonstrating the system’s potential to reliably evaluate LLM-generated clinical text. By leveraging open-source foundation models and a modular, scalable retrieval-and-reasoning design, VeriFact can serve as a practical guardrail for LLM-based EHR applications and beyond, while outlining clear avenues for improvement and future research.
Abstract
Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.
