ACES: Automatic Cohort Extraction System for Event-Stream Datasets
Justin Xu, Jack Gallifant, Alistair E. W. Johnson, Matthew B. A. McDermott
TL;DR
The paper addresses the reproducibility crisis in ML for healthcare driven by private EHR data and inconsistent cohort definitions. It introduces ACES, an Automatic Cohort Extraction System that uses a domain-specific language and an event-stream representation to define and extract task-specific cohorts across diverse datasets. Key contributions include a recursive task-configuration algorithm, dataset-agnostic task definitions, a CLI and Python API, and a repository of example configurations with MEDS/ESGPT compatibility, plus scalability via data sharding. The work aims to lower barriers to sharing task definitions, enable conceptual reproducibility across datasets, and enable new cross-dataset benchmarks in healthcare ML, accelerating robust method development.
Abstract
Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task or cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. We address a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of tasks and cohorts for ML in healthcare and also enable their reproduction, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides: (1) a highly intuitive and expressive domain-specific configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion or exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or Event Stream GPT (ESGPT) formats, or to *any* dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks in representation learning, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies using this modality. ACES is available at: https://github.com/justin13601/aces.
