Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection
Jiyoun Kim, Junu Kim, Kyunghoon Hur, Edward Choi
TL;DR
The paper tackles practical hurdles in host-specific healthcare prediction with federated learning, notably heterogeneity of EHR systems and budget constraints. It introduces EHRFL, coupling text-based EHR linearization with Transformer-based encoding to enable cross-institution collaboration without costly CDM standardization, and a privacy-preserving, averaged-embedding client-selection method to reduce participation costs. Empirical results on MIMIC and eICU ICU datasets show that text-based FL improves host performance and that averaged embedding similarity reliably predicts which subjects contribute positively, enabling substantial cost savings without sacrificing accuracy. The framework offers a scalable, privacy-conscious approach to deploying cost-effective, institution-tailored FL in real-world healthcare settings, with open-source code to support adoption and further research.
Abstract
The increasing volume of electronic health records (EHRs) presents the opportunity to improve the accuracy and robustness of models in clinical prediction tasks. Unlike traditional centralized approaches, federated learning enables training on data from multiple institutions while preserving patient privacy and complying with regulatory constraints. In practice, healthcare institutions (i.e., hosts) often need to build predictive models tailored to their specific needs (e.g., creatinine-level prediction, N-day readmission prediction) using federated learning. When building a federated learning model for a single healthcare institution, two key challenges arise: (1) ensuring compatibility across heterogeneous EHR systems, and (2) managing federated learning costs within budget constraints. Specifically, heterogeneity in EHR systems across institutions hinders compatible modeling, while the computational costs of federated learning can exceed practical budget limits for healthcare institutions. To address these challenges, we propose EHRFL, a federated learning framework designed for building a cost-effective, host-specific predictive model using patient EHR data. EHRFL consists of two components: (1) text-based EHR modeling, which facilitates cross-institution compatibility without costly data standardization, and (2) a participant selection strategy based on averaged patient embedding similarity to reduce the number of participants without degrading performance. Our participant selection strategy sharing averaged patient embeddings is differentially private, ensuring patient privacy. Experiments on multiple open-source EHR datasets demonstrate the effectiveness of both components. With our framework, healthcare institutions can build institution-specific predictive models under budgetary constraints with reduced costs and time.
