Table of Contents
Fetching ...

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

Angelo Ziletti, Leonardo D'Ambrosi

TL;DR

The paper tackles automated generation of clinical cohorts from EHR data by translating observational study criteria into executable SQL queries via a two-step retrieval-augmented generation framework. It introduces two complementary knowledge bases, EpiAskKB for analytical questions and EpiCohoKB for inclusion/exclusion criteria, and uses criterion- and cohort-level retrieval augmented prompts plus medical concept standardization and placeholder-based SQL generation. A self-healing loop and a transparent patient funnel provide interpretability and error handling during end-to-end generation. On Optum OMOP-CDM data, the system achieves a cohort-identification F1 of 0.75 and demonstrates robust handling of temporal and logical relationships, with open-source release and Bayer deployment in progress.

Abstract

Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

TL;DR

The paper tackles automated generation of clinical cohorts from EHR data by translating observational study criteria into executable SQL queries via a two-step retrieval-augmented generation framework. It introduces two complementary knowledge bases, EpiAskKB for analytical questions and EpiCohoKB for inclusion/exclusion criteria, and uses criterion- and cohort-level retrieval augmented prompts plus medical concept standardization and placeholder-based SQL generation. A self-healing loop and a transparent patient funnel provide interpretability and error handling during end-to-end generation. On Optum OMOP-CDM data, the system achieves a cohort-identification F1 of 0.75 and demonstrates robust handling of temporal and logical relationships, with open-source release and Bayer deployment in progress.

Abstract

Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.

Paper Structure

This paper contains 11 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: (a-g) From inclusion/exclusion criteria in natural language to patient cohorts using electronic health record databases: end-to-end workflow. (h) Query decomposition and patient funnel generation through LLM-based processing.
  • Figure 2: Performance evaluation of text-to-SQL generation for patient cohort identification. Valid SQL indicates syntactically correct queries, Retrieved indicates queries that successfully retrieved patient data. Patient-level metrics evaluate cohort membership accuracy, while date-level metrics assess the temporal alignment of cohort index dates. Higher values are better, all values in percentage. Bold indicates best results, underlined shows second best.