Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare
Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono
TL;DR
Privacy regulations and data heterogeneity impede large-scale ML on electronic health records. The authors propose a two-step data alignment method that combines ontology-based candidate matching with LLM adjudication within a privacy-preserving federated learning framework implemented via Brane and EPI, validated on a real-world MPRINT use case. Their results show mapping precision in the range of 78% to 92% across unannotated and ICD-10 annotated datasets, with high alignment between LLMs and human experts after refinement. This work demonstrates a practical path toward interoperable, secure FL in healthcare and outlines a roadmap for open, low-code data-harmonization workflows that can accelerate clinical research while preserving privacy.
Abstract
The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.
