Table of Contents
Fetching ...

Federated Variational Inference for Bayesian Mixture Models

Jackie Rao, Francesca L. Crowe, Tom Marshall, Sylvia Richardson, Paul D. W. Kirk

TL;DR

This work addresses scalable, privacy-preserving clustering of large binary and categorical datasets in a federated setting. It introduces FedMerDel, a one-shot federated variational algorithm that performs local merge/delete moves (MerDel) within data batches and then a principled global merge across batches using the ELBO objective, without sharing raw data. Empirical results on simulations, MNIST, and THIN EHR data show FedMerDel achieves clustering accuracy close to centralized methods while offering substantial speedups and robustness to batch heterogeneity, with a viable variable selection extension for noisy features. The approach has practical impact for population-level disease clustering and multimorbidity analysis in healthcare, enabling scalable, privacy-aware analysis across institutions.

Abstract

We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled 'divide and conquer' inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by 'global' merge moves across batches to find global clustering structures. We show that these merge moves require only summaries of the data in each batch, enabling federated learning across local nodes without requiring the full dataset to be shared. Empirical results on simulated and benchmark datasets demonstrate that our method performs well in comparison to existing clustering algorithms. We validate the practical utility of the method by applying it to large scale electronic health record (EHR) data.

Federated Variational Inference for Bayesian Mixture Models

TL;DR

This work addresses scalable, privacy-preserving clustering of large binary and categorical datasets in a federated setting. It introduces FedMerDel, a one-shot federated variational algorithm that performs local merge/delete moves (MerDel) within data batches and then a principled global merge across batches using the ELBO objective, without sharing raw data. Empirical results on simulations, MNIST, and THIN EHR data show FedMerDel achieves clustering accuracy close to centralized methods while offering substantial speedups and robustness to batch heterogeneity, with a viable variable selection extension for noisy features. The approach has practical impact for population-level disease clustering and multimorbidity analysis in healthcare, enabling scalable, privacy-aware analysis across institutions.

Abstract

We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled 'divide and conquer' inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by 'global' merge moves across batches to find global clustering structures. We show that these merge moves require only summaries of the data in each batch, enabling federated learning across local nodes without requiring the full dataset to be shared. Empirical results on simulated and benchmark datasets demonstrate that our method performs well in comparison to existing clustering algorithms. We validate the practical utility of the method by applying it to large scale electronic health record (EHR) data.

Paper Structure

This paper contains 75 sections, 26 equations, 21 figures, 17 tables.

Figures (21)

  • Figure 1: Plot comparing the mean time taken by MerDel with and without parallelisation ('par' and 'full'), and FedMerDel as we vary $N$ with 5 or 10 batches/cores in 'Global Merge Simulations', with an approximate 95% confidence interval (mean $\pm$ 1.96 $\times \frac{\text{s.d}}{\sqrt{n}}$). FedMerDel results use random search.
  • Figure 2: EHR clustering results with random search. Columns are health conditions, and rows correspond to clusters. Shading indicates proportion of individuals in each cluster with each condition; row labels show the number of individuals per cluster. To improve visualisation, only the most prevalent health conditions are shown. See Appendix \ref{['EHRappendix']} for further details, coding of health conditions and greedy search results.
  • Figure 3: Scatter plot comparing ARIs achieved by each model across all datasets and initialisations in Simulations 1.1, 1.2 and 1.3 (labelled 1, 2, 3 respectively). Each point represents one ARI from one MerDel run.
  • Figure 4: Plot comparing final number of clusters in clustering models in simulations 1.1-1.3.
  • Figure 6: Scatter plot comparing ARIs across all datasets and initialisations in Simulations 1.4 and 1.5.
  • ...and 16 more figures