Table of Contents
Fetching ...

ACES: Automatic Cohort Extraction System for Event-Stream Datasets

Justin Xu, Jack Gallifant, Alistair E. W. Johnson, Matthew B. A. McDermott

TL;DR

The paper addresses the reproducibility crisis in ML for healthcare driven by private EHR data and inconsistent cohort definitions. It introduces ACES, an Automatic Cohort Extraction System that uses a domain-specific language and an event-stream representation to define and extract task-specific cohorts across diverse datasets. Key contributions include a recursive task-configuration algorithm, dataset-agnostic task definitions, a CLI and Python API, and a repository of example configurations with MEDS/ESGPT compatibility, plus scalability via data sharding. The work aims to lower barriers to sharing task definitions, enable conceptual reproducibility across datasets, and enable new cross-dataset benchmarks in healthcare ML, accelerating robust method development.

Abstract

Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task or cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. We address a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of tasks and cohorts for ML in healthcare and also enable their reproduction, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides: (1) a highly intuitive and expressive domain-specific configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion or exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or Event Stream GPT (ESGPT) formats, or to *any* dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks in representation learning, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies using this modality. ACES is available at: https://github.com/justin13601/aces.

ACES: Automatic Cohort Extraction System for Event-Stream Datasets

TL;DR

The paper addresses the reproducibility crisis in ML for healthcare driven by private EHR data and inconsistent cohort definitions. It introduces ACES, an Automatic Cohort Extraction System that uses a domain-specific language and an event-stream representation to define and extract task-specific cohorts across diverse datasets. Key contributions include a recursive task-configuration algorithm, dataset-agnostic task definitions, a CLI and Python API, and a repository of example configurations with MEDS/ESGPT compatibility, plus scalability via data sharding. The work aims to lower barriers to sharing task definitions, enable conceptual reproducibility across datasets, and enable new cross-dataset benchmarks in healthcare ML, accelerating robust method development.

Abstract

Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task or cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. We address a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of tasks and cohorts for ML in healthcare and also enable their reproduction, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides: (1) a highly intuitive and expressive domain-specific configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion or exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or Event Stream GPT (ESGPT) formats, or to *any* dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks in representation learning, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies using this modality. ACES is available at: https://github.com/justin13601/aces.
Paper Structure (23 sections, 4 figures, 3 tables)

This paper contains 23 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Workflow for extracting cohorts using ACES. The pipeline shows the expected format for ACES-supported event-stream datasets and outcome cohorts. The transformation of raw data into the event-stream format is intentionally designed to be straightforward --- primarily merging relational database tables --- minimizing data loss risks associated with other CDMs like OMOP.
  • Figure 2: ML task cohort extraction process (A) with and (B) without ACES. Predicates are dataset-specific concepts that are needed to conceptually capture a ML task. Windows are temporal segments on a patient's health record and are dataset-agnostic, as they are defined relative to the predicates. This distinction allows researchers to easily share the more complex task logic which is independent of datasets, facilitating conceptual reproducibility for ML tasks in healthcare.
  • Figure 3: Overview of the ACES recursive algorithm. Given a task tree generated from a configuration file, ACES first identifies possible roots of the tree (task triggers) based on the associated predicate. It then computes aggregations of predicate counts over time-based (i.e., windows with a time interval) or event-based (i.e., windows between specified events) periods to summarize predicates over the edges between the tree nodes. Finally, invalid branches are filtered out if their predicate counts do not meet the specified criteria. This process is recursed for all child nodes of the task tree.
  • Figure 4: Example configuration file for the binary prediction of in-hospital mortality 48 hours after admission. References to predicates and windows are italicized and bolded, respectively. (A) Dataset-specific task predicates. These concepts are needed to conceptually capture this task and are used as constraints and boundaries for windows of the patient record. For instance, in this example, the value of "$ADMISSION$" denotes a hospital admission event in the source dataset. (B) A window of the task specifying the task inputs for downstream models. Suppose we'd like to use all historic patient data up to and including 24 hours past the admission. We could also place an arbitrary criterion requiring more than 5 records in this window to ensure that the extracted cohort contains sufficient input data. (C) Trigger events for the task, which are hospital admissions as we'd like to make a mortality prediction for each admission. (D) A window of the task specifying a gap in the patient timeline. Suppose we'd like to set a minimum length of admission for our cohort (e.g., 48 hours). A temporal constraint (minimum window duration) of 48 hours could then be set to represent this requirement. (E) A window of the task specifying the task target, which is set from the end of (D) to the immediately subsequent $discharge$ or $death$ predicate. This creates our binary label classes for the task (i.e., $discharge=0$; $death=1$). All windows are interrelated on the patient timeline, as shown by how each window references another in the configuration file.