Table of Contents
Fetching ...

VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data

Paul D. W. Kirk, Jackie Rao

TL;DR

VICatMix introduces a variational Bayesian finite mixture model tailored for high-dimensional discrete biomedical data, integrating automatic variable selection and model averaging to robustly infer cluster structure when the true number of clusters is unknown. The method uses an overfitted $K$-component mixture with sparse Dirichlet priors and a mean-field VI framework, augmented by a co-clustering matrix to stabilise results across multiple initialisations. Variable selection enhances performance in noisy, high-dimensional settings, with thresholds applied over multiple runs to produce a concise feature set. Applications to simulated data and real TCGA datasets—yeast GO categories, AML mutations, and pan-cancer COCA—demonstrate accurate clustering, biologically meaningful gene selection, and the ability to uncover interpretable cancer subtypes; the approach is implemented in an efficient R package with C++ acceleration for scalability.

Abstract

Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes. \textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at \url{https://github.com/j-ackierao/VICatMix}.

VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data

TL;DR

VICatMix introduces a variational Bayesian finite mixture model tailored for high-dimensional discrete biomedical data, integrating automatic variable selection and model averaging to robustly infer cluster structure when the true number of clusters is unknown. The method uses an overfitted -component mixture with sparse Dirichlet priors and a mean-field VI framework, augmented by a co-clustering matrix to stabilise results across multiple initialisations. Variable selection enhances performance in noisy, high-dimensional settings, with thresholds applied over multiple runs to produce a concise feature set. Applications to simulated data and real TCGA datasets—yeast GO categories, AML mutations, and pan-cancer COCA—demonstrate accurate clustering, biologically meaningful gene selection, and the ability to uncover interpretable cancer subtypes; the approach is implemented in an efficient R package with C++ acceleration for scalability.

Abstract

Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes. \textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at \url{https://github.com/j-ackierao/VICatMix}.

Paper Structure

This paper contains 46 sections, 31 equations, 34 figures, 11 tables.

Figures (34)

  • Figure 1: Graphical representation of VICatMix. Here, ${\bf z}_n$ is a '1-of-K' latent variable associated with the data point ${\bf x}_n$ representing its cluster allocation; see the Supplementary Material for more details.
  • Figure 2: Boxplots comparing the ARI and number of clusters of each model-averaging method across all 10 simulated datasets with the grand mean of the individual runs considered with different numbers of clustering solutions in the co-clustering matrix for Simulation 2.1.
  • Figure 3: Heatmap of the VICatMix-Avg clustering structure on the yeast galactose data compared with the GO labelling when K=10.
  • Figure 4: Heatmaps of the VICatMixVarSel-Avg clustering structure on the AML mutation dataset.
  • Figure 5: Dotplots visualising over-representation analysis (ORA) for 6 selected genes for the AML dataset using gene-disease annotations from the Disease Ontology (DO). p.adjust is the p-value from the hypergeometric test used in ORA, adjusted using the Benjamini-Hochberg procedure. The top 10 most significant annotations are shown.
  • ...and 29 more figures