Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

Angelo Ziletti; Leonardo D'Ambrosi

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

Angelo Ziletti, Leonardo D'Ambrosi

TL;DR

The paper tackles automated generation of clinical cohorts from EHR data by translating observational study criteria into executable SQL queries via a two-step retrieval-augmented generation framework. It introduces two complementary knowledge bases, EpiAskKB for analytical questions and EpiCohoKB for inclusion/exclusion criteria, and uses criterion- and cohort-level retrieval augmented prompts plus medical concept standardization and placeholder-based SQL generation. A self-healing loop and a transparent patient funnel provide interpretability and error handling during end-to-end generation. On Optum OMOP-CDM data, the system achieves a cohort-identification F1 of 0.75 and demonstrates robust handling of temporal and logical relationships, with open-source release and Bayer deployment in progress.

Abstract

Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

TL;DR

Abstract

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)