Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

Jiyoun Kim; Junu Kim; Kyunghoon Hur; Edward Choi

Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

Jiyoun Kim, Junu Kim, Kyunghoon Hur, Edward Choi

TL;DR

The paper tackles practical hurdles in host-specific healthcare prediction with federated learning, notably heterogeneity of EHR systems and budget constraints. It introduces EHRFL, coupling text-based EHR linearization with Transformer-based encoding to enable cross-institution collaboration without costly CDM standardization, and a privacy-preserving, averaged-embedding client-selection method to reduce participation costs. Empirical results on MIMIC and eICU ICU datasets show that text-based FL improves host performance and that averaged embedding similarity reliably predicts which subjects contribute positively, enabling substantial cost savings without sacrificing accuracy. The framework offers a scalable, privacy-conscious approach to deploying cost-effective, institution-tailored FL in real-world healthcare settings, with open-source code to support adoption and further research.

Abstract

The increasing volume of electronic health records (EHRs) presents the opportunity to improve the accuracy and robustness of models in clinical prediction tasks. Unlike traditional centralized approaches, federated learning enables training on data from multiple institutions while preserving patient privacy and complying with regulatory constraints. In practice, healthcare institutions (i.e., hosts) often need to build predictive models tailored to their specific needs (e.g., creatinine-level prediction, N-day readmission prediction) using federated learning. When building a federated learning model for a single healthcare institution, two key challenges arise: (1) ensuring compatibility across heterogeneous EHR systems, and (2) managing federated learning costs within budget constraints. Specifically, heterogeneity in EHR systems across institutions hinders compatible modeling, while the computational costs of federated learning can exceed practical budget limits for healthcare institutions. To address these challenges, we propose EHRFL, a federated learning framework designed for building a cost-effective, host-specific predictive model using patient EHR data. EHRFL consists of two components: (1) text-based EHR modeling, which facilitates cross-institution compatibility without costly data standardization, and (2) a participant selection strategy based on averaged patient embedding similarity to reduce the number of participants without degrading performance. Our participant selection strategy sharing averaged patient embeddings is differentially private, ensuring patient privacy. Experiments on multiple open-source EHR datasets demonstrate the effectiveness of both components. With our framework, healthcare institutions can build institution-specific predictive models under budgetary constraints with reduced costs and time.

Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 3 figures, 6 tables)

This paper contains 20 sections, 8 equations, 3 figures, 6 tables.

Introduction
Overall Framework
Text-based EHR Federated Learning
Structure of EHRs
Text-based EHR Linearization
Modeling of EHRs
Participating Subject Selection using Averaged Patient Embeddings
Subject Selection Process
Overall Cost Savings for the Host
Experimental Settings
Datasets
Setup
Cohort & Prediction Tasks
Differential Privacy Parameters
Federated Learning Algorithms
...and 5 more sections

Figures (3)

Figure 1: Federated learning across healthcare institutions (i.e., host, subject) of heterogeneous EHR systems. EHR data is linearized into a standardized text-based format for compatible modeling.
Figure 2: Selection of participating subjects in federated learning based on averaged patient embedding similarity with the host. To ensure privacy, each subject constructs its averaged patient embedding using differential privacy by clipping individual patient embeddings, averaging them, and adding Gaussian noise to the averaged embedding. To ensure consistency in similarity computation, the host applies the same clipping operation to its patient embeddings prior to averaging. Subjects with low similarity scores relative to the host are excluded from the federated learning process.
Figure 3: UML Diagram for EHRFL

Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

TL;DR

Abstract

Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)