Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation
Angelo Ziletti, Leonardo D'Ambrosi
TL;DR
The paper tackles automated generation of clinical cohorts from EHR data by translating observational study criteria into executable SQL queries via a two-step retrieval-augmented generation framework. It introduces two complementary knowledge bases, EpiAskKB for analytical questions and EpiCohoKB for inclusion/exclusion criteria, and uses criterion- and cohort-level retrieval augmented prompts plus medical concept standardization and placeholder-based SQL generation. A self-healing loop and a transparent patient funnel provide interpretability and error handling during end-to-end generation. On Optum OMOP-CDM data, the system achieves a cohort-identification F1 of 0.75 and demonstrates robust handling of temporal and logical relationships, with open-source release and Bayer deployment in progress.
Abstract
Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.
