Multinomial belief networks for healthcare data
H. C. Donker, D. Neijzen, J. de Jong, G. A. Lunter
TL;DR
This work introduces a multinomial belief network (MBN), a deep Bayesian model for healthcare data that handles sparse, high-mimension, and incomplete count data while providing uncertainty quantification. Building on the Poisson gamma belief network (PGBN), the MBN uses multinomial observables and Dirichlet activations, enabling a deep, interpretable representation with layer-wise augmentation via Dirichlet–multinomial–CRT factorization. The authors develop a collapsed Gibbs sampler that propagates information up and down the network, achieving posterior updates for latent weights, activations, and dispersion via conjugate relationships and augmentation identities. Demonstrations on handwritten digits and a large cancer mutational dataset show that the MBN discovers coherent hierarchical structures and biologically meaningful mutational signatures, with superior held-out perplexity compared to nonnegative matrix factorization baselines and robust, data-driven interpretation. The approach enables principled deconvolution of heterogeneous healthcare data and provides uncertainty estimates essential for clinical decision-making, though scaling to very large datasets remains a challenge to be addressed with future approximate or hybrid inference methods.
Abstract
Healthcare data from patient or population cohorts are often characterized by sparsity, high missingness and relatively small sample sizes. In addition, being able to quantify uncertainty is often important in a medical context. To address these analytical requirements we propose a deep generative Bayesian model for multinomial count data. We develop a collapsed Gibbs sampling procedure that takes advantage of a series of augmentation relations, inspired by the Zhou$\unicode{x2013}$Cong$\unicode{x2013}$Chen model. We visualise the model's ability to identify coherent substructures in the data using a dataset of handwritten digits. We then apply it to a large experimental dataset of DNA mutations in cancer and show that we can identify biologically meaningful clusters of mutational signatures in a fully data-driven way.
