A Common Pipeline for Harmonizing Electronic Health Record Data for Translational Research
Jessica Gronsbell, Vidul Ayakulangara Panickan, Doudou Zhou, Chris Lin, Thomas Charlon, Chuan Hong, Xin Xiong, Linshanshan Wang, Jianhui Gao, Shirley Zhou, Yuan Tian, Yaqi Shi, Ziming Gan, Tianxi Cai
TL;DR
The paper addresses semantic and technical barriers in using diverse EHR data for translational research. It proposes PEHRT, a two-module pipeline for data pre-processing and scalable representation learning that harmonizes structured and unstructured EHR data to standardized ontologies while preserving privacy via summary-level data. It combines PMI-SVD EHR embeddings, PLM-based embeddings, and a joint cross-institution embedding (BONMI) to mitigate heterogeneity and enable federated analysis; it also provides open-source software and an online tutorial. The work demonstrates embedding quality gains and downstream predictive analytics in a multi-institutional context and discusses limitations and future directions.
Abstract
Despite the growing availability of Electronic Health Record (EHR) data, researchers often face substantial barriers in effectively using these data for translational research due to their complexity, heterogeneity, and lack of standardized tools and documentation. To address this critical gap, we introduce PEHRT, a common pipeline for harmonizing EHR data for translational research. PEHRT is a comprehensive, ready-to-use resource that includes open-source code, visualization tools, and detailed documentation to streamline the process of preparing EHR data for analysis. The pipeline provides tools to harmonize structured and unstructured EHR data to standardized ontologies to ensure consistency across diverse coding systems. In the presence of unmapped or heterogeneous local codes, PEHRT further leverages representation learning and pre-trained language models to generate robust embeddings that capture semantic relationships across sites to mitigate heterogeneity and enable integrative downstream analyses. PEHRT also supports cross-institutional co-training through shared representations, allowing participating sites to collaboratively refine embeddings and enhance generalizability without sharing individual-level data. The framework is data model-agnostic and can be seamlessly deployed across diverse healthcare systems to produce interoperable, research-ready datasets. By lowering the technical barriers to EHR-based research, PEHRT empowers investigators to transform raw clinical data into reproducible, analysis-ready resources for discovery and innovation.
