Table of Contents
Fetching ...

Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data

Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai

TL;DR

The paper tackles the challenge of heterogeneity and privacy in multi-institution EHR studies by introducing GAME, a graph-attention network–based representation framework that harmonizes institution-specific codes with standard vocabularies through a knowledge-graph–augmented, privacy-preserving pipeline. GAME jointly leverages PPMI-SVD embeddings from co-occurrence patterns, SapBERT/PLM-derived semantic representations, and GPT-4–generated mappings to learn cross-institution code embeddings via KG-enhanced GAT alignment, followed by contrastive training to capture similarity and relatedness. The approach is validated on tasks including cross-institution code mapping, feature selection, and federated patient stratification for Alzheimer’s disease and suicide risk, showing superior performance to baselines while preserving patient privacy by sharing only summary statistics. This framework enables scalable, interpretable, cross-institution AI for precision medicine and multi-site research, with potential extensions to dynamic coding updates and the integration of clinical notes.

Abstract

The adoption of EHRs has expanded opportunities to leverage data-driven algorithms in clinical care and research. A major bottleneck in effectively conducting multi-institutional EHR studies is the data heterogeneity across systems with numerous codes that either do not exist or represent different clinical concepts across institutions. The need for data privacy further limits the feasibility of including multi-institutional patient-level data required to study similarities and differences across patient subgroups. To address these challenges, we developed the GAME algorithm. Tested and validated across 7 institutions and 2 languages, GAME integrates data in several levels: (1) at the institutional level with knowledge graphs to establish relationships between codes and existing knowledge sources, providing the medical context for standard codes and their relationship to each other; (2) between institutions, leveraging language models to determine the relationships between institution-specific codes with established standard codes; and (3) quantifying the strength of the relationships between codes using a graph attention network. Jointly trained embeddings are created using transfer and federated learning to preserve data privacy. In this study, we demonstrate the applicability of GAME in selecting relevant features as inputs for AI-driven algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. We then highlight the application of GAME harmonized multi-institutional EHR data in a study of Alzheimer's disease outcomes and suicide risk among patients with mental health disorders, without sharing patient-level data outside individual institutions.

Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data

TL;DR

The paper tackles the challenge of heterogeneity and privacy in multi-institution EHR studies by introducing GAME, a graph-attention network–based representation framework that harmonizes institution-specific codes with standard vocabularies through a knowledge-graph–augmented, privacy-preserving pipeline. GAME jointly leverages PPMI-SVD embeddings from co-occurrence patterns, SapBERT/PLM-derived semantic representations, and GPT-4–generated mappings to learn cross-institution code embeddings via KG-enhanced GAT alignment, followed by contrastive training to capture similarity and relatedness. The approach is validated on tasks including cross-institution code mapping, feature selection, and federated patient stratification for Alzheimer’s disease and suicide risk, showing superior performance to baselines while preserving patient privacy by sharing only summary statistics. This framework enables scalable, interpretable, cross-institution AI for precision medicine and multi-site research, with potential extensions to dynamic coding updates and the integration of clinical notes.

Abstract

The adoption of EHRs has expanded opportunities to leverage data-driven algorithms in clinical care and research. A major bottleneck in effectively conducting multi-institutional EHR studies is the data heterogeneity across systems with numerous codes that either do not exist or represent different clinical concepts across institutions. The need for data privacy further limits the feasibility of including multi-institutional patient-level data required to study similarities and differences across patient subgroups. To address these challenges, we developed the GAME algorithm. Tested and validated across 7 institutions and 2 languages, GAME integrates data in several levels: (1) at the institutional level with knowledge graphs to establish relationships between codes and existing knowledge sources, providing the medical context for standard codes and their relationship to each other; (2) between institutions, leveraging language models to determine the relationships between institution-specific codes with established standard codes; and (3) quantifying the strength of the relationships between codes using a graph attention network. Jointly trained embeddings are created using transfer and federated learning to preserve data privacy. In this study, we demonstrate the applicability of GAME in selecting relevant features as inputs for AI-driven algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. We then highlight the application of GAME harmonized multi-institutional EHR data in a study of Alzheimer's disease outcomes and suicide risk among patients with mental health disorders, without sharing patient-level data outside individual institutions.

Paper Structure

This paper contains 50 sections, 11 equations, 18 figures, 17 tables, 2 algorithms.

Figures (18)

  • Figure 1: Overview of GAME approach with (a) data source, extraction, processing, and algorithm, and (b) key steps in the validation of the approach.
  • Figure 2: The data processing procedure of the GAME algorithm.
  • Figure 3: Mapping local codes to standard codes using GPT-4.
  • Figure 4: Overview of key steps in the GAME algorithm: (a) aligning embeddings into a shared representation space, (b) sequentially learning the similarity and relatedness with contrastive learning.
  • Figure 5: Comparison of AUCs for detecting similarity (left) and relatedness (right) relationships using embeddings from different methods or PLMs. PPMI AVE stands for the average AUC by using the institutional PPMI embeddings.
  • ...and 13 more figures