Distributed MCMC inference for Bayesian Non-Parametric Latent Block Model
Reda Khoufache, Anisse Belhadj, Hanene Azzag, Mustapha Lebbah
TL;DR
This work tackles scalable co-clustering with unknown numbers of row and column clusters by introducing DisNPLBM, a distributed MCMC inference framework for Bayesian Non-Parametric Latent Block Models. It adopts a Master/Worker architecture where rows are partitioned across workers that update local row memberships and send sufficient statistics to a master, which then estimates the global row and column partitions, enabling streaming, asynchronous fusion of results. The approach leverages collapsed Gibbs sampling with analytic predictive distributions under NIW priors, achieving substantial speedups while preserving clustering accuracy, and demonstrates applicability to gene expression datasets. The implementation, public at the provided repository, shows strong scalability and competitive performance against centralized BNP methods, with potential extensions to other coclustering setups.
Abstract
In this paper, we introduce a novel Distributed Markov Chain Monte Carlo (MCMC) inference method for the Bayesian Non-Parametric Latent Block Model (DisNPLBM), employing the Master/Worker architecture. Our non-parametric co-clustering algorithm divides observations and features into partitions using latent multivariate Gaussian block distributions. The workload on rows is evenly distributed among workers, who exclusively communicate with the master and not among themselves. DisNPLBM demonstrates its impact on cluster labeling accuracy and execution times through experimental results. Moreover, we present a real-use case applying our approach to co-cluster gene expression data. The code source is publicly available at https://github.com/redakhoufache/Distributed-NPLBM.
