Table of Contents
Fetching ...

Federated One-Shot Ensemble Clustering

Rui Duan, Xin Xiong, Jueyi Liu, Katherine P. Liao, Tianxi Cai

TL;DR

Multi-site clustering under data-sharing and privacy constraints is challenging due to restricted access to raw data. The authors propose FONT, a one-shot federated ensemble clustering framework that builds a data-adaptive ensemble of locally fitted clustering results by exchanging only model parameters and predicted labels. They provide a spectral weighting scheme with theoretical guarantees that the ensemble distance converges to the true interclass distance and is at least as good as the best local estimator, even with a fraction of poor models, and they demonstrate robustness and improved transferability through simulations. An application to rheumatoid arthritis medication sequence data across two health systems shows that FONT yields four cross-site latent clusters with higher cross-system consistency than single-site analyses. Overall, FONT offers a privacy-preserving, scalable approach for practical multi-site clustering tasks that can accommodate a variety of clustering methods.

Abstract

Cluster analysis across multiple institutions poses significant challenges due to data-sharing restrictions. To overcome these limitations, we introduce the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution tailored for multi-site analyses under such constraints. FONT requires only a single round of communication between sites and ensures privacy by exchanging only fitted model parameters and class labels. The algorithm combines locally fitted clustering models into a data-adaptive ensemble, making it broadly applicable to various clustering techniques and robust to differences in cluster proportions across sites. Our theoretical analysis validates the effectiveness of the data-adaptive weights learned by FONT, and simulation studies demonstrate its superior performance compared to existing benchmark methods. We applied FONT to identify subgroups of patients with rheumatoid arthritis across two health systems, revealing improved consistency of patient clusters across sites, while locally fitted clusters proved less transferable. FONT is particularly well-suited for real-world applications with stringent communication and privacy constraints, offering a scalable and practical solution for multi-site clustering.

Federated One-Shot Ensemble Clustering

TL;DR

Multi-site clustering under data-sharing and privacy constraints is challenging due to restricted access to raw data. The authors propose FONT, a one-shot federated ensemble clustering framework that builds a data-adaptive ensemble of locally fitted clustering results by exchanging only model parameters and predicted labels. They provide a spectral weighting scheme with theoretical guarantees that the ensemble distance converges to the true interclass distance and is at least as good as the best local estimator, even with a fraction of poor models, and they demonstrate robustness and improved transferability through simulations. An application to rheumatoid arthritis medication sequence data across two health systems shows that FONT yields four cross-site latent clusters with higher cross-system consistency than single-site analyses. Overall, FONT offers a privacy-preserving, scalable approach for practical multi-site clustering tasks that can accommodate a variety of clustering methods.

Abstract

Cluster analysis across multiple institutions poses significant challenges due to data-sharing restrictions. To overcome these limitations, we introduce the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution tailored for multi-site analyses under such constraints. FONT requires only a single round of communication between sites and ensures privacy by exchanging only fitted model parameters and class labels. The algorithm combines locally fitted clustering models into a data-adaptive ensemble, making it broadly applicable to various clustering techniques and robust to differences in cluster proportions across sites. Our theoretical analysis validates the effectiveness of the data-adaptive weights learned by FONT, and simulation studies demonstrate its superior performance compared to existing benchmark methods. We applied FONT to identify subgroups of patients with rheumatoid arthritis across two health systems, revealing improved consistency of patient clusters across sites, while locally fitted clusters proved less transferable. FONT is particularly well-suited for real-world applications with stringent communication and privacy constraints, offering a scalable and practical solution for multi-site clustering.
Paper Structure (10 sections, 2 theorems, 13 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 10 sections, 2 theorems, 13 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Suppose that $\{\bm{z}_m\}_{1\le m\le M}$ are identically distributed sub-Gaussian vectors with parameter $\sigma^2$, and for each $m$, we have $\|\bm{z}_m\|_2^2=c\sigma^2 \tau_N^2(1+o_P(1))$ for some constant $c>0$ and sequence $\tau_N$. Suppose that $\|\bold{f}_0\|_2/\sigma\gg \log N+\tau_N\sqrt{\

Figures (5)

  • Figure 1: Model performance across different simulation settings evaluated by the average rand index.
  • Figure 2: Correlation between the data adaptive weights received by the local models and the performance of local models evaluated by adjusted rand index.
  • Figure 3: Medication frequency at each time point of the four clusters identified by FONT.
  • Figure 4: Site difference in (a) transition probability matrices and (b) initial probabilities fitted by Markov models on MGB and VA, stratified by the ensemble (left) or local (right) fitted clustering membership.
  • Figure 5: Evaluating the clusters using ESR.

Theorems & Definitions (4)

  • Remark 1
  • Theorem 1
  • Theorem 2
  • Remark 2