Table of Contents
Fetching ...

Federated Random Forest for Partially Overlapping Clinical Data

Youngjun Park, Cord Eric Schmidt, Benedikt Marcel Batton, Anne-Christin Hauschild

TL;DR

Clinical data in healthcare are fragmented due to privacy laws and non-harmonized features, hindering large-scale analysis. The authors adapt Federated Random Forest to partially overlapping data (FRF-POD), training locally and exchanging trees via a central aggregator to form a globally optimized model that respects local feature availability; aggregation can be additive or constant to update local forests. Across three clinical datasets (ILPD, HCC, BCD), FRF-POD consistently improves prediction (AUC/PRAUC/MCC) over purely local RFs, though gains attenuate as site count grows or feature overlap shrinks, with additive aggregation often offering stronger gains than constant. This approach enables privacy-preserving, multi-site collaboration for heterogeneous clinical data, reducing the need for data centralization while maintaining performance.

Abstract

In the healthcare sector, a consciousness surrounding data privacy and corresponding data protection regulations, as well as heterogeneous and non-harmonized data, pose huge challenges to large-scale data analysis. Moreover, clinical data often involves partially overlapping features, as some observations may be missing due to various reasons, such as differences in procedures, diagnostic tests, or other recorded patient history information across hospitals or institutes. To address the challenges posed by partially overlapping features and incomplete data in clinical datasets, a comprehensive approach is required. Particularly in the domain of medical data, promising outcomes are achieved by federated random forests whenever features align. However, for most standard algorithms, like random forest, it is essential that all data sets have identical parameters. Therefore, in this work the concept of federated random forest is adapted to a setting with partially overlapping features. Moreover, our research assesses the effectiveness of the newly developed federated random forest models for partially overlapping clinical data. For aggregating the federated, globally optimized model, only features available locally at each site can be used. We tackled two issues in federation: (i) the quantity of involved parties, (ii) the varying overlap of features. This evaluation was conducted across three clinical datasets. The federated random forest model even in cases where only a subset of features overlaps consistently demonstrates superior performance compared to its local counterpart. This holds true across various scenarios, including datasets with imbalanced classes. Consequently, federated random forests for partially overlapped data offer a promising solution to transcend barriers in collaborative research and corporate cooperation.

Federated Random Forest for Partially Overlapping Clinical Data

TL;DR

Clinical data in healthcare are fragmented due to privacy laws and non-harmonized features, hindering large-scale analysis. The authors adapt Federated Random Forest to partially overlapping data (FRF-POD), training locally and exchanging trees via a central aggregator to form a globally optimized model that respects local feature availability; aggregation can be additive or constant to update local forests. Across three clinical datasets (ILPD, HCC, BCD), FRF-POD consistently improves prediction (AUC/PRAUC/MCC) over purely local RFs, though gains attenuate as site count grows or feature overlap shrinks, with additive aggregation often offering stronger gains than constant. This approach enables privacy-preserving, multi-site collaboration for heterogeneous clinical data, reducing the need for data centralization while maintaining performance.

Abstract

In the healthcare sector, a consciousness surrounding data privacy and corresponding data protection regulations, as well as heterogeneous and non-harmonized data, pose huge challenges to large-scale data analysis. Moreover, clinical data often involves partially overlapping features, as some observations may be missing due to various reasons, such as differences in procedures, diagnostic tests, or other recorded patient history information across hospitals or institutes. To address the challenges posed by partially overlapping features and incomplete data in clinical datasets, a comprehensive approach is required. Particularly in the domain of medical data, promising outcomes are achieved by federated random forests whenever features align. However, for most standard algorithms, like random forest, it is essential that all data sets have identical parameters. Therefore, in this work the concept of federated random forest is adapted to a setting with partially overlapping features. Moreover, our research assesses the effectiveness of the newly developed federated random forest models for partially overlapping clinical data. For aggregating the federated, globally optimized model, only features available locally at each site can be used. We tackled two issues in federation: (i) the quantity of involved parties, (ii) the varying overlap of features. This evaluation was conducted across three clinical datasets. The federated random forest model even in cases where only a subset of features overlaps consistently demonstrates superior performance compared to its local counterpart. This holds true across various scenarios, including datasets with imbalanced classes. Consequently, federated random forests for partially overlapped data offer a promising solution to transcend barriers in collaborative research and corporate cooperation.
Paper Structure (11 sections, 1 equation, 3 figures, 2 tables)

This paper contains 11 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overview of the suggested federated random forest model. Data is split into multiple sites and random features are dropped to simulate partially overlapping conditions. The federated random forest model for partially non-overlapping data aggregate all trees from local random forest models. The sites can request and extract a globally aggregated random forest model. Then each optimized local model is evaluated with the respective local test data.
  • Figure 2: The inner quartile range plots for HCC and ILPD data. Comparison of random forest model improved with federated learning (blue; go-local RF) and local random forest model (red; local RF) is shown. The plots in the first row show AUC scores by varying number of sites when data has 20% dropped features. The plots in the second row show AUC scores by varying number of features dropped when data is split into two sites. The inner quartile range plot surrounds the mean AUC with the respective inner quartile range shaded around.
  • Figure 3: Difference of the AUC score between the federation methods. The addition and constant method are compared in terms of mean AUCs in all scenarios with different number of sites and features dropped. The Gardner-Altman comparison plots depict the distribution of various split/feature-drop scenarios and the corresponding mean differences among those scenarios.