Table of Contents
Fetching ...

Federated unsupervised random forest for privacy-preserving patient stratification

Bastian Pfeifer, Christel Sirocchi, Marcus D. Bloice, Markus Kreuzthaler, Martin Urschler

TL;DR

The paper addresses privacy in disease subtyping from multi-omics data by proposing a federated unsupervised random-forest clustering framework. It introduces a novel unsupervised splitting rule based on the Fixation Index, $\Delta_{\mathcal{F}}(t, x(t), z(t)) = \frac{(D_{\square}(t^{left}) + D_{\square}(t^{right}))/2}{D_{\nabla}(t)}$, to form an affinity matrix of same-leaf sample pairs. A federated ensemble aggregates local trees into a global model, from which a normalized affinity matrix $\hat{A}_{global}$ is clustered with Ward linkage. The approach achieves competitive disease subtyping, improves interpretability via cluster-specific feature importance, and demonstrates privacy-preserving performance on TCGA and benchmark data. This enables privacy-preserving, interpretable precision medicine workflows for decentralized omics data.

Abstract

In the realm of precision medicine, effective patient stratification and disease subtyping demand innovative methodologies tailored for multi-omics data. Clustering techniques applied to multi-omics data have become instrumental in identifying distinct subgroups of patients, enabling a finer-grained understanding of disease variability. This work establishes a powerful framework for advancing precision medicine through unsupervised random-forest-based clustering and federated computing. We introduce a novel multi-omics clustering approach utilizing unsupervised random-forests. The unsupervised nature of the random forest enables the determination of cluster-specific feature importance, unraveling key molecular contributors to distinct patient groups. Moreover, our methodology is designed for federated execution, a crucial aspect in the medical domain where privacy concerns are paramount. We have validated our approach on machine learning benchmark data sets as well as on cancer data from The Cancer Genome Atlas (TCGA). Our method is competitive with the state-of-the-art in terms of disease subtyping, but at the same time substantially improves the cluster interpretability. Experiments indicate that local clustering performance can be improved through federated computing.

Federated unsupervised random forest for privacy-preserving patient stratification

TL;DR

The paper addresses privacy in disease subtyping from multi-omics data by proposing a federated unsupervised random-forest clustering framework. It introduces a novel unsupervised splitting rule based on the Fixation Index, , to form an affinity matrix of same-leaf sample pairs. A federated ensemble aggregates local trees into a global model, from which a normalized affinity matrix is clustered with Ward linkage. The approach achieves competitive disease subtyping, improves interpretability via cluster-specific feature importance, and demonstrates privacy-preserving performance on TCGA and benchmark data. This enables privacy-preserving, interpretable precision medicine workflows for decentralized omics data.

Abstract

In the realm of precision medicine, effective patient stratification and disease subtyping demand innovative methodologies tailored for multi-omics data. Clustering techniques applied to multi-omics data have become instrumental in identifying distinct subgroups of patients, enabling a finer-grained understanding of disease variability. This work establishes a powerful framework for advancing precision medicine through unsupervised random-forest-based clustering and federated computing. We introduce a novel multi-omics clustering approach utilizing unsupervised random-forests. The unsupervised nature of the random forest enables the determination of cluster-specific feature importance, unraveling key molecular contributors to distinct patient groups. Moreover, our methodology is designed for federated execution, a crucial aspect in the medical domain where privacy concerns are paramount. We have validated our approach on machine learning benchmark data sets as well as on cancer data from The Cancer Genome Atlas (TCGA). Our method is competitive with the state-of-the-art in terms of disease subtyping, but at the same time substantially improves the cluster interpretability. Experiments indicate that local clustering performance can be improved through federated computing.
Paper Structure (13 sections, 12 equations, 7 figures, 1 table)

This paper contains 13 sections, 12 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Clustering performance in terms of Adjusted Rand Index (ARI) of the Ward clustering algorithm on affinity matrices derived from the proposed unsupervised Random Forest with novel split rule uRF, Euclidean distance (HC) and Euclidean distance on standardised data (HCscaled), evaluated in four scenarios: (a) globular clusters of equal size, (b) globular clusters with outliers, (c) globular clusters of varying sizes, and (d) non-globular clusters shaped as concentric circles.
  • Figure 2: (a) Varying the minimum size of leaf nodes. (b) Varying the number of sampled features. The resulting dendrograms are cut such that the number of clusters align with the ground-truth.
  • Figure 3: Verifying the optimal number of clusters using the proposed unsupervised random forest by subsequently reducing the number of trees. The clustering solutions at varies levels of $k$ were created using an unsupervised random forest comprising 500 trees. The derived affinity matrix served as an input for hierarchical clustering. The dendrogram was cut at different $k$ levels and the resulting multi-class labels were passed back to the unsupervised random forest as a response vector. In this way we label the samples within the leaf nodes of the unsupervised random forest to allow for predictions.
  • Figure 4: Non-federated disease subtype discovery based on multi-omics data in comparison with alternative approaches. Coloured bars represent the method-specific $p$-values of the Cox log-rank test from 30 iterations. The vertical line refers to the $\alpha= 0.05$ significance level. In case of uRF the Silhouette coefficient was used to determine the optimal number of clusters.
  • Figure 5: Kidney (KIRC) cancer data set. Survival curves of the four detected clusters are displayed. The other panels show the cluster-specific feature importance values and their inter-cluster correlation.
  • ...and 2 more figures