Table of Contents
Fetching ...

EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases

Kwanhyung Lee, Sungsoo Hong, Joonhyung Park, Jeonghyeop Lim, Juhwan Choi, Donghwee Yoon, Eunho Yang

TL;DR

The paper tackles the reproducibility and scalability barrier in clinical ML caused by heterogeneous EMR schemas by introducing EMR-AGENT, an agent-based preprocessing framework. It deploys two LLM-driven agents, CFSA for cohort/feature selection and CMA for code mapping, which iteratively observe database outputs and reason over schema and documentation to automate extraction. A dedicated benchmark suite, PreCISE-EMR, assessing MIMIC-III, eICU, and SICdb, demonstrates strong cross-database generalization and the value of external knowledge in schema guidance. The work provides substantial evidence that automated, rule-free EMR preprocessing can approach human expert performance and offers publicly available code to foster reproducibility and broader adoption.

Abstract

Machine learning models for clinical prediction rely on structured data extracted from Electronic Medical Records (EMRs), yet this process remains dominated by hardcoded, database-specific pipelines for cohort definition, feature selection, and code mapping. These manual efforts limit scalability, reproducibility, and cross-institutional generalization. To address this, we introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction to extract and standardize structured clinical data. Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases. Our modular agents iteratively observe query results and reason over schema and documentation, using SQL not just for data retrieval but also as a tool for database observation and decision making. This eliminates the need for hand-crafted, schema-specific logic. To enable rigorous evaluation, we develop a benchmarking codebase for three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen schema settings. Our results demonstrate strong performance and generalization across these databases, highlighting the feasibility of automating a process previously thought to require expert-driven design. The code will be released publicly at https://github.com/AITRICS/EMR-AGENT/tree/main. For a demonstration, please visit our anonymous demo page: https://anonymoususer-max600.github.io/EMR_AGENT/

EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases

TL;DR

The paper tackles the reproducibility and scalability barrier in clinical ML caused by heterogeneous EMR schemas by introducing EMR-AGENT, an agent-based preprocessing framework. It deploys two LLM-driven agents, CFSA for cohort/feature selection and CMA for code mapping, which iteratively observe database outputs and reason over schema and documentation to automate extraction. A dedicated benchmark suite, PreCISE-EMR, assessing MIMIC-III, eICU, and SICdb, demonstrates strong cross-database generalization and the value of external knowledge in schema guidance. The work provides substantial evidence that automated, rule-free EMR preprocessing can approach human expert performance and offers publicly available code to foster reproducibility and broader adoption.

Abstract

Machine learning models for clinical prediction rely on structured data extracted from Electronic Medical Records (EMRs), yet this process remains dominated by hardcoded, database-specific pipelines for cohort definition, feature selection, and code mapping. These manual efforts limit scalability, reproducibility, and cross-institutional generalization. To address this, we introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction to extract and standardize structured clinical data. Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases. Our modular agents iteratively observe query results and reason over schema and documentation, using SQL not just for data retrieval but also as a tool for database observation and decision making. This eliminates the need for hand-crafted, schema-specific logic. To enable rigorous evaluation, we develop a benchmarking codebase for three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen schema settings. Our results demonstrate strong performance and generalization across these databases, highlighting the feasibility of automating a process previously thought to require expert-driven design. The code will be released publicly at https://github.com/AITRICS/EMR-AGENT/tree/main. For a demonstration, please visit our anonymous demo page: https://anonymoususer-max600.github.io/EMR_AGENT/

Paper Structure

This paper contains 63 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of the shift from (A) the conventional Rule-Based Manual Pipeline, where clinical experts must handcraft cohort and feature extraction logic as well as mapping codes for each database, to (B) our EMR-AGENT (Agent-Based Extraction Framework), which automates these processes through iterative interaction with the database, enabling generalization to diverse schemas.
  • Figure 2: Illustration of the two main components of EMR-AGENT: (a) CFSA dynamically selects cohorts and features from diverse EMR databases, reducing manual intervention; (b) CMA harmonizes database-specific codes for uniform feature representation.
  • Figure 3: Comparison of Observation-SQL Number and F1 Score across EMR databases.
  • Figure A.1: A flowchart for comparison of MIMIC-III benchmark as a reliability evaluation.
  • Figure A.2: A flowchart for comparison in eICU benchmark as a reliability evaluation.
  • ...and 3 more figures