Table of Contents
Fetching ...

Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis

Haonan Zhu, Andre R. Goncalves, Camilo Valdes, Hiranmayi Ranganathan, Boya Zhang, Jose Manuel Martí, Car Reen Kok, Monica K. Borucki, Nisha J. Mulakken, James B. Thissen, Crystal Jaing, Alfred Hero, Nicholas A. Be

TL;DR

The paper tackles binary health-state prediction from high-dimensional microbiome data pooled across multiple studies by introducing a hierarchical Bayesian multitask logistic regression with a shared sparsity prior. It derives scalable variational inference using a mean-field approximation and coordinate ascent updates to approximate the intractable posterior, incorporating a Bernoulli-Gaussian sparsity pattern $z_j\sim\mathrm{Bernoulli}(\theta)$, $\theta\sim\mathrm{Beta}(\alpha_0,\beta_0)$, and task-wide weight covariance $\boldsymbol\Sigma_0$ with $\boldsymbol\Sigma_0^{-1}\sim\mathrm{Wishart}(v_0,\boldsymbol V_0)$. Through synthetic and real microbiome experiments, the approach achieves strong support recovery under shared sparsity and provides well-calibrated predictions with uncertainty quantification, even amid heterogeneous pooled data. The results highlight robustness to cross-study heterogeneity and offer interpretable insights by identifying informative microbial taxa across diseases, with potential for improved biomarker discovery and clinical decision support.

Abstract

This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem where the model assumes a shared sparsity structure across different tasks. We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution. We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile. Our analysis incorporates data pooled from multiple microbiome studies, along with a comprehensive comparison with other benchmark methods. Results in synthetic datasets show that the proposed approach has superior support recovery property when the underlying regression coefficients share a common sparsity structure across different tasks. Our experiments on microbiome classification demonstrate the utility of the method in extracting informative taxa while providing well-calibrated predictions with uncertainty quantification and achieving competitive performance in terms of prediction metrics. Notably, despite the heterogeneity of the pooled datasets (e.g., different experimental objectives, laboratory setups, sequencing equipment, patient demographics), our method delivers robust results.

Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis

TL;DR

The paper tackles binary health-state prediction from high-dimensional microbiome data pooled across multiple studies by introducing a hierarchical Bayesian multitask logistic regression with a shared sparsity prior. It derives scalable variational inference using a mean-field approximation and coordinate ascent updates to approximate the intractable posterior, incorporating a Bernoulli-Gaussian sparsity pattern , , and task-wide weight covariance with . Through synthetic and real microbiome experiments, the approach achieves strong support recovery under shared sparsity and provides well-calibrated predictions with uncertainty quantification, even amid heterogeneous pooled data. The results highlight robustness to cross-study heterogeneity and offer interpretable insights by identifying informative microbial taxa across diseases, with potential for improved biomarker discovery and clinical decision support.

Abstract

This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem where the model assumes a shared sparsity structure across different tasks. We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution. We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile. Our analysis incorporates data pooled from multiple microbiome studies, along with a comprehensive comparison with other benchmark methods. Results in synthetic datasets show that the proposed approach has superior support recovery property when the underlying regression coefficients share a common sparsity structure across different tasks. Our experiments on microbiome classification demonstrate the utility of the method in extracting informative taxa while providing well-calibrated predictions with uncertainty quantification and achieving competitive performance in terms of prediction metrics. Notably, despite the heterogeneity of the pooled datasets (e.g., different experimental objectives, laboratory setups, sequencing equipment, patient demographics), our method delivers robust results.

Paper Structure

This paper contains 13 sections, 8 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Graphical visualization of the proposed probabilistic model. For the base model in (a), the information sharing cross different tasks is enforced by common priors between the regression weights and sparsity parameters. For the clustering extension (b), the information sharing is limited to tasks from the same cluster (i.e., disease category).
  • Figure 2: Summary of the support recovery results for the simulated data evaluated using balanced accuracy across $10$ different runs. The proposed Bayesian approach (BayesMTL) outperforms the benchmark methods in terms of balanced accuracy especially when there is a shared sparsity structure across regression coefficients of different tasks. Both MSSL and MTFL prioritize the prediction performance in the cross-validation step which results in complete dense solutions (i.e., all regression coefficients are non-zero), hence they achieve identical accuracy.
  • Figure 3: Summary of the prediction performance evaluated by balanced accuracy across 7 taxonomic ranks. Due to the heterogeneous nature of the data, we do not see an improvement of the proposed approach over single-tasked models. However, the proposed approach is the only multitask method that provides a sparse solution, i.e., identifying common microbes across studies of the same disease category that are informative for prediction along with uncertainty quantification through the posterior distribution.
  • Figure 4: Calibration analysis for the proposed model on the Order taxonomic rank. Fig. (a) and Fig. (b) show the histograms of the predicted probabilities and training and test data respectively. Due to the choice of logit as link function, the predicted probabilities are concentrated around the boundaries. Fig. (c) shows the calibration curves of the predictions from training and test data. The model achieves near perfect calibration on the training data, and the degradation of performance on the test data at the boundary values indicates that the logit function as a link function is resulting in over-confident predictions.
  • Figure 5: Feature sparsity visualization across $19$ different disease categories of Order taxonomic rank. The $x$-axis corresponds to different samples drawn from the posterior distribution and the $y$-axis correspond to different taxIDs. The gradation from white to black for variable color corresponds to its increasing importance weight, and the darker shaded horizontal lines represent the sparse features selected by the algorithm. For diabetes and diarrhea, few taxIDs are considered informative for the health prediction task by the model, while for cardiovascular disease the majority of taxIDs are considered informative.
  • ...and 1 more figures