Table of Contents
Fetching ...

A Common Pipeline for Harmonizing Electronic Health Record Data for Translational Research

Jessica Gronsbell, Vidul Ayakulangara Panickan, Doudou Zhou, Chris Lin, Thomas Charlon, Chuan Hong, Xin Xiong, Linshanshan Wang, Jianhui Gao, Shirley Zhou, Yuan Tian, Yaqi Shi, Ziming Gan, Tianxi Cai

TL;DR

The paper addresses semantic and technical barriers in using diverse EHR data for translational research. It proposes PEHRT, a two-module pipeline for data pre-processing and scalable representation learning that harmonizes structured and unstructured EHR data to standardized ontologies while preserving privacy via summary-level data. It combines PMI-SVD EHR embeddings, PLM-based embeddings, and a joint cross-institution embedding (BONMI) to mitigate heterogeneity and enable federated analysis; it also provides open-source software and an online tutorial. The work demonstrates embedding quality gains and downstream predictive analytics in a multi-institutional context and discusses limitations and future directions.

Abstract

Despite the growing availability of Electronic Health Record (EHR) data, researchers often face substantial barriers in effectively using these data for translational research due to their complexity, heterogeneity, and lack of standardized tools and documentation. To address this critical gap, we introduce PEHRT, a common pipeline for harmonizing EHR data for translational research. PEHRT is a comprehensive, ready-to-use resource that includes open-source code, visualization tools, and detailed documentation to streamline the process of preparing EHR data for analysis. The pipeline provides tools to harmonize structured and unstructured EHR data to standardized ontologies to ensure consistency across diverse coding systems. In the presence of unmapped or heterogeneous local codes, PEHRT further leverages representation learning and pre-trained language models to generate robust embeddings that capture semantic relationships across sites to mitigate heterogeneity and enable integrative downstream analyses. PEHRT also supports cross-institutional co-training through shared representations, allowing participating sites to collaboratively refine embeddings and enhance generalizability without sharing individual-level data. The framework is data model-agnostic and can be seamlessly deployed across diverse healthcare systems to produce interoperable, research-ready datasets. By lowering the technical barriers to EHR-based research, PEHRT empowers investigators to transform raw clinical data into reproducible, analysis-ready resources for discovery and innovation.

A Common Pipeline for Harmonizing Electronic Health Record Data for Translational Research

TL;DR

The paper addresses semantic and technical barriers in using diverse EHR data for translational research. It proposes PEHRT, a two-module pipeline for data pre-processing and scalable representation learning that harmonizes structured and unstructured EHR data to standardized ontologies while preserving privacy via summary-level data. It combines PMI-SVD EHR embeddings, PLM-based embeddings, and a joint cross-institution embedding (BONMI) to mitigate heterogeneity and enable federated analysis; it also provides open-source software and an online tutorial. The work demonstrates embedding quality gains and downstream predictive analytics in a multi-institutional context and discusses limitations and future directions.

Abstract

Despite the growing availability of Electronic Health Record (EHR) data, researchers often face substantial barriers in effectively using these data for translational research due to their complexity, heterogeneity, and lack of standardized tools and documentation. To address this critical gap, we introduce PEHRT, a common pipeline for harmonizing EHR data for translational research. PEHRT is a comprehensive, ready-to-use resource that includes open-source code, visualization tools, and detailed documentation to streamline the process of preparing EHR data for analysis. The pipeline provides tools to harmonize structured and unstructured EHR data to standardized ontologies to ensure consistency across diverse coding systems. In the presence of unmapped or heterogeneous local codes, PEHRT further leverages representation learning and pre-trained language models to generate robust embeddings that capture semantic relationships across sites to mitigate heterogeneity and enable integrative downstream analyses. PEHRT also supports cross-institutional co-training through shared representations, allowing participating sites to collaboratively refine embeddings and enhance generalizability without sharing individual-level data. The framework is data model-agnostic and can be seamlessly deployed across diverse healthcare systems to produce interoperable, research-ready datasets. By lowering the technical barriers to EHR-based research, PEHRT empowers investigators to transform raw clinical data into reproducible, analysis-ready resources for discovery and innovation.

Paper Structure

This paper contains 17 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: PEHRT enables users to prepare a harmonized, research-ready dataset with two modules for data pre-processing and representation learning. Each step of PEHRT is detailed in our user-friendly tutorial and supported by open-source software and web APIs for data visualization.
  • Figure 2: AUC of lasso-penalized logistic regression models for predicting disability status in MS patients based on Patient Determined Disease Steps (PDDS) scores two years after their first visit using varying numbers of selected features (20, 50, 100, and 200). Comparisons are shown for different embedding methods, including BONMI+, BONMI, PLM-based embeddings, institution-specific EHR embeddings (PMI-SVD), and a "random" method consisting of randomly selected features combined with the main PheCode and healthcare utilization feature. Results are presented separately for UPMC (left) and MGB (right), with higher AUC values indicate better predictive performance. The training and test sample sizes are both 500.
  • Figure 3: C-index of lasso-penalized Cox proportional hazards models for predicting time to nursing home admission or death in Alzheimer's disease (AD) patients using varying numbers of selected features (20, 50, 100, and 200). Comparisons are shown for different embedding methods, including BONMI+, BONMI, PLM-based embeddings, institution-specific EHR embeddings (PMI-SVD), and a "random" method consisting of randomly selected features combined with the main PheCode and healthcare utilization feature. Results are presented separately for UPMC (left) and MGB (right), with higher C-index values indicating better predictive performance. The training sample size is 15,000.