Table of Contents
Fetching ...

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono

TL;DR

Privacy regulations and data heterogeneity impede large-scale ML on electronic health records. The authors propose a two-step data alignment method that combines ontology-based candidate matching with LLM adjudication within a privacy-preserving federated learning framework implemented via Brane and EPI, validated on a real-world MPRINT use case. Their results show mapping precision in the range of 78% to 92% across unannotated and ICD-10 annotated datasets, with high alignment between LLMs and human experts after refinement. This work demonstrates a practical path toward interoperable, secure FL in healthcare and outlines a roadmap for open, low-code data-harmonization workflows that can accelerate clinical research while preserving privacy.

Abstract

The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

TL;DR

Privacy regulations and data heterogeneity impede large-scale ML on electronic health records. The authors propose a two-step data alignment method that combines ontology-based candidate matching with LLM adjudication within a privacy-preserving federated learning framework implemented via Brane and EPI, validated on a real-world MPRINT use case. Their results show mapping precision in the range of 78% to 92% across unannotated and ICD-10 annotated datasets, with high alignment between LLMs and human experts after refinement. This work demonstrates a practical path toward interoperable, secure FL in healthcare and outlines a roadmap for open, low-code data-harmonization workflows that can accelerate clinical research while preserving privacy.

Abstract

The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

Paper Structure

This paper contains 13 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: VANTAGE6 Moncada-Torres2021 The server loads its configuration parameters and exposes its RESTful API Nodes.
  • Figure 2: Brane's approach to the distributed workflow implementation via separation of user roles 9582292.
  • Figure 3: EPI Framework kassem2021epi.
  • Figure 4: LLM-based pipeline to align data with target vocabulary. (A) Prepare target mapping space; (B) Find best matching targets for input data; (C) Define an acceptance criteria for LLM and evaluate best matching pairs.
  • Figure 5: LLM-based pipeline to annotate patient outcomes with MONDO and/or HPO ontology terms. (A) Extract labels and synonyms from target databases, retain ontology identifiers as metadata; (B) Find best matching labels for each outcome and form pairs; (C) Ask LLM to accept pairs with the same or more general target outcomes (diseases).
  • ...and 4 more figures