Table of Contents
Fetching ...

Challenges and recommendations for Electronic Health Records data extraction and preparation for dynamic prediction modelling in hospitalized patients -- a practical guide

Elena Albu, Shan Gao, Pieter Stijnen, Frank E. Rademakers, Bas C T van Bussel, Taya Collyer, Tina Hernandez-Boussard, Laure Wynants, Ben Van Calster

TL;DR

This paper tackles the problem that the reliability and clinical utility of dynamic EHR-based prediction models hinge on the quality of data extracted and prepared from hospital EHR systems. It offers a practical guide by cataloging more than 40 challenges across four stages—cohort definition, outcome definition, feature engineering, and data cleaning—and provides actionable recommendations to mitigate them. The authors map these challenges to established data-quality frameworks (Weiskopf and Weng; METRIC) and emphasize maintaining timestamp integrity and avoiding temporal leaks in single-site, structured EHR contexts. The result is a pragmatic resource to improve data extraction and preparation, enhance model robustness, and support real-world deployment in clinical settings, while acknowledging limitations and the need for context-specific adaptation.

Abstract

Dynamic predictive modelling using electronic health record (EHR) data has gained significant attention in recent years. The reliability and trustworthiness of such models depend heavily on the quality of the underlying data, which is, in part, determined by the stages preceding the model development: data extraction from EHR systems and data preparation. In this article, we identified over forty challenges encountered during these stages and provide actionable recommendations for addressing them. These challenges are organized into four categories: cohort definition, outcome definition, feature engineering, and data cleaning. This comprehensive list serves as a practical guide for data extraction engineers and researchers, promoting best practices and improving the quality and real-world applicability of dynamic prediction models in clinical settings.

Challenges and recommendations for Electronic Health Records data extraction and preparation for dynamic prediction modelling in hospitalized patients -- a practical guide

TL;DR

This paper tackles the problem that the reliability and clinical utility of dynamic EHR-based prediction models hinge on the quality of data extracted and prepared from hospital EHR systems. It offers a practical guide by cataloging more than 40 challenges across four stages—cohort definition, outcome definition, feature engineering, and data cleaning—and provides actionable recommendations to mitigate them. The authors map these challenges to established data-quality frameworks (Weiskopf and Weng; METRIC) and emphasize maintaining timestamp integrity and avoiding temporal leaks in single-site, structured EHR contexts. The result is a pragmatic resource to improve data extraction and preparation, enhance model robustness, and support real-world deployment in clinical settings, while acknowledging limitations and the need for context-specific adaptation.

Abstract

Dynamic predictive modelling using electronic health record (EHR) data has gained significant attention in recent years. The reliability and trustworthiness of such models depend heavily on the quality of the underlying data, which is, in part, determined by the stages preceding the model development: data extraction from EHR systems and data preparation. In this article, we identified over forty challenges encountered during these stages and provide actionable recommendations for addressing them. These challenges are organized into four categories: cohort definition, outcome definition, feature engineering, and data cleaning. This comprehensive list serves as a practical guide for data extraction engineers and researchers, promoting best practices and improving the quality and real-world applicability of dynamic prediction models in clinical settings.
Paper Structure (10 sections, 1 figure)

This paper contains 10 sections, 1 figure.

Figures (1)

  • Figure 1: Data flow for model building pipeline (I.) and model implementation (II.) Two databases are exemplified as data sources, EHR and ICU, although multiple other sources might be used in the hospital's flow and for data extraction. EHR = Electronic Health Record; DB = database; ICU = Intensive Care Unit; REST API = RESTful application programming interface