Structured Extraction of Real World Medical Knowledge using LLMs for Summarization and Search
Edward Kim, Manil Shrestha, Richard Foty, Tom DeLay, Vicki Seyfert-Margolis
TL;DR
The paper addresses the challenge of extracting and organizing real-world medical knowledge from heterogeneous data sources by proposing patient knowledge graphs built from LLM-based entity extraction and grounded to established ontologies (MeSH, SNOMED-CT, RxNORM, HPO). It demonstrates the approach on a large ambulatory care EHR cohort (33.6M patients), showing Dravet syndrome ICD-10 recognition and construction of patient-specific graphs, followed by discovery of BPAN where ground truth was unavailable. The work highlights the evolution of biomedical NLP from rule-based and transformer models to LLMs that enable zero-shot and few-shot extraction, while emphasizing ontology grounding to improve interoperability and search. Overall, the framework enables natural-language data extraction and ontology-grounded search for scalable real-world disease discovery, including rare diseases, with potential for accelerating clinical research design and real-world evidence synthesis.
Abstract
Creation and curation of knowledge graphs can accelerate disease discovery and analysis in real-world data. While disease ontologies aid in biological data annotation, codified categories (SNOMED-CT, ICD10, CPT) may not capture patient condition nuances or rare diseases. Multiple disease definitions across data sources complicate ontology mapping and disease clustering. We propose creating patient knowledge graphs using large language model extraction techniques, allowing data extraction via natural language rather than rigid ontological hierarchies. Our method maps to existing ontologies (MeSH, SNOMED-CT, RxNORM, HPO) to ground extracted entities. Using a large ambulatory care EHR database with 33.6M patients, we demonstrate our method through the patient search for Dravet syndrome, which received ICD10 recognition in October 2020. We describe our construction of patient-specific knowledge graphs and symptom-based patient searches. Using confirmed Dravet syndrome ICD10 codes as ground truth, we employ LLM-based entity extraction to characterize patients in grounded ontologies. We then apply this method to identify Beta-propeller protein-associated neurodegeneration (BPAN) patients, demonstrating real-world discovery where no ground truth exists.
