Federated Variational Inference for Bayesian Mixture Models
Jackie Rao, Francesca L. Crowe, Tom Marshall, Sylvia Richardson, Paul D. W. Kirk
TL;DR
This work addresses scalable, privacy-preserving clustering of large binary and categorical datasets in a federated setting. It introduces FedMerDel, a one-shot federated variational algorithm that performs local merge/delete moves (MerDel) within data batches and then a principled global merge across batches using the ELBO objective, without sharing raw data. Empirical results on simulations, MNIST, and THIN EHR data show FedMerDel achieves clustering accuracy close to centralized methods while offering substantial speedups and robustness to batch heterogeneity, with a viable variable selection extension for noisy features. The approach has practical impact for population-level disease clustering and multimorbidity analysis in healthcare, enabling scalable, privacy-aware analysis across institutions.
Abstract
We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled 'divide and conquer' inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by 'global' merge moves across batches to find global clustering structures. We show that these merge moves require only summaries of the data in each batch, enabling federated learning across local nodes without requiring the full dataset to be shared. Empirical results on simulated and benchmark datasets demonstrate that our method performs well in comparison to existing clustering algorithms. We validate the practical utility of the method by applying it to large scale electronic health record (EHR) data.
