Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

Natallia Kokash; Lei Wang; Thomas H. Gillespie; Adam Belloum; Paola Grosso; Sara Quinney; Lang Li; Bernard de Bono

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono

TL;DR

Privacy regulations and data heterogeneity impede large-scale ML on electronic health records. The authors propose a two-step data alignment method that combines ontology-based candidate matching with LLM adjudication within a privacy-preserving federated learning framework implemented via Brane and EPI, validated on a real-world MPRINT use case. Their results show mapping precision in the range of 78% to 92% across unannotated and ICD-10 annotated datasets, with high alignment between LLMs and human experts after refinement. This work demonstrates a practical path toward interoperable, secure FL in healthcare and outlines a roadmap for open, low-code data-harmonization workflows that can accelerate clinical research while preserving privacy.

Abstract

The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

TL;DR

Abstract

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)