Table of Contents
Fetching ...

Bayesian Covariate-Varying Interaction Analysis for Multivariate Count Data: Application to Microbiome Studies

Shuangjie Zhang, Michael L. Patnode, Juhee Lee

Abstract

Understanding covariate-varying interdependencies among features is of great interest in various applications. Motivated by microbiome studies where microbial abundances and interactions vary with environmental factors, we develop a Bayesian covariate-varying factor model. This model flexibly estimates heteroscedasticity in the covariance matrix as a function of covariates. Specifically, our approach employs covariance regression through linear regression on a lower-dimensional factor loading matrix. This formulation, combined with joint sparsity induced by the Dirichlet--Horseshoe prior for the factor loadings, provides robust estimation of covariate-varying covariance in high-dimensional settings. The model simultaneously incorporates a regression structure for the mean abundance and jointly addresses the covariate-varying mean and covariance structure. Furthermore, the model tackles key statistical challenges such as discreteness, over-dispersion, compositionality, and high dimensionality, common in microbiome data analysis, using a flexible nonparametric Bayesian framework. We thoroughly investigate the properties of the model and conduct extensive simulation studies to examine its performance. Real microbiome data examples are provided for illustration.

Bayesian Covariate-Varying Interaction Analysis for Multivariate Count Data: Application to Microbiome Studies

Abstract

Understanding covariate-varying interdependencies among features is of great interest in various applications. Motivated by microbiome studies where microbial abundances and interactions vary with environmental factors, we develop a Bayesian covariate-varying factor model. This model flexibly estimates heteroscedasticity in the covariance matrix as a function of covariates. Specifically, our approach employs covariance regression through linear regression on a lower-dimensional factor loading matrix. This formulation, combined with joint sparsity induced by the Dirichlet--Horseshoe prior for the factor loadings, provides robust estimation of covariate-varying covariance in high-dimensional settings. The model simultaneously incorporates a regression structure for the mean abundance and jointly addresses the covariate-varying mean and covariance structure. Furthermore, the model tackles key statistical challenges such as discreteness, over-dispersion, compositionality, and high dimensionality, common in microbiome data analysis, using a flexible nonparametric Bayesian framework. We thoroughly investigate the properties of the model and conduct extensive simulation studies to examine its performance. Real microbiome data examples are provided for illustration.
Paper Structure (11 sections, 14 equations, 10 figures, 1 table)

This paper contains 11 sections, 14 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: [Simulation 1] Panel (a) has a histogram of differences between $\hat{\Sigma}_{jj^\prime}(\bm{x}_i)$ and $\Sigma^{{\hbox{\scriptsize tr}}}_{jj^\prime}(\bm{x}_i)$ under six levels, $j \leq j^\prime$. Panels (b) and (c) compare $\hat{\Sigma}^{\hbox{\scriptsize tr}}(\bm{x})$ (lower triangular) to its posterior median estimates $\hat{\Sigma}(\bm{x})$ (upper triangular) for two arbitrarily selected conditions, $\bm{x}=(1, 0, 0, 1)^\prime$ and $\bm{x}^\prime=(1, 1, 0, 0)^\prime$.
  • Figure 2: [Simulation 1] The posterior median estimate of mean abundance $\mu_{ij}$ is plotted against the truth in panels (a). Panels (b)-(d) plot posterior point estimates of $\beta_{j1}-\beta_{j2}$, $\beta_{j4} -\beta_{j3}$ and $\beta_{j5} -\beta_{j3}$ against their true values. The red dots represent posterior median estimates, and the grey vertical lines 95% posterior credible interval estimates. Panel (e) and (f) illustrate the covariance estimate of MOFA+ for two arbitrarily selected conditions, $\bm{x}=(1, 0, 0, 1)^\prime$ and $\bm{x}^\prime=(1, 1, 0, 0)^\prime$.
  • Figure 3: [Simulation 2] Panel (a) has a histogram of differences between $\hat{\Sigma}_{jj^\prime}(\bm{x}_i)$ and $\Sigma^{{\hbox{\scriptsize tr}}}_{jj^\prime}(\bm{x}_i)$ of all samples. In (b), the lower left and upper right triangles of the heatmap illustrate true values $\Sigma^{{\hbox{\scriptsize tr}}}_{jj^\prime}$ and their posterior estimates of correlations $\hat{\Sigma}_{jj^\prime}$, respectively. Two samples, samples 2 and 27, from subject 2, are arbitrarily chosen for illustration. Their covariates are $\bm{x}_{2}=(1,-1.23), \bm{x}_{27}=(0,-1.23)$.
  • Figure 4: [Simulation 2] Scatter plots of $\Sigma_{jj^\prime}(\bm{x})$ (dashed line) and $\Sigma^{\hbox{\scriptsize tr}}_{jj^\prime}(\bm{x})$ (solid line) are plotted for three arbitrarily chosen OTU pairs, OTUs 67 and 86 in panel (a), OTUs 4 and 96 in panel (b), and OTUs 74 and 90 in panel (c). Crosses are observed values of the continuous covariate $x_c$. The red and blue colors are for $x^d=0$ and 1, respectively. The shades represent pointwise 95% posterior credible interval estimates.
  • Figure 5: [Simulation 2] The posterior estimates of the effects of the binary and continuous covariates on abundance are plotted in panels (a) and (b), respectively. The dots represent the posterior median estimates, while the vertical lines indicate their corresponding 95% credible interval estimates.
  • ...and 5 more figures