A Bayesian approach to differential prevalence analysis with applications in microbiome studies

Juho Pelto; Kari Auranen; Janne V. Kujala; Leo Lahti

A Bayesian approach to differential prevalence analysis with applications in microbiome studies

Juho Pelto, Kari Auranen, Janne V. Kujala, Leo Lahti

TL;DR

This study addresses differential prevalence analysis (DPA) for microbiome presence/absence data, highlighting boundary-case and multiplicity challenges in traditional methods. It introduces DiPPER, a Bayesian hierarchical model that borrows information across features using a shared asymmetric Laplace prior on the log-odds differences $\beta_j$, with covariates and sequencing depth accounted for in a logistic regression framework. Posterior inference is obtained via No-U-Turn Sampling in Stan, yielding multiplicity-adjusted uncertainty intervals and finite estimates even in boundary cases. On 80 original datasets from 67 gut microbiome studies, DiPPER shows high sensitivity and strong cross-study replication relative to frequentist DPA and DAA methods, while providing interpretable differential prevalence estimates and scalable uncertainty; robustness to hyperpriors and potential extensions to differential abundance analysis are discussed. Practical implications include more reliable detection of disease-associated presence/absence signals and reduced reliance on p-value corrections, with potential applicability to other omics domains.

Abstract

Recent evidence suggests that analyzing the presence/absence of taxonomic features can offer a compelling alternative to differential abundance analysis in microbiome studies. However, standard approaches face challenges with boundary cases and multiple testing. To address these challenges, we developed DiPPER (Differential Prevalence via Probabilistic Estimation in R), a method based on Bayesian hierarchical modeling. We benchmarked our method against existing differential prevalence and abundance methods using data from 67 publicly available human gut microbiome studies. We observed considerable variation in performance across methods, with DiPPER outperforming alternatives by combining high sensitivity with effective error control. DiPPER also demonstrated superior replication of findings across independent studies. Furthermore, DiPPER provides differential prevalence estimates and uncertainty intervals that are inherently adjusted for multiple testing.

A Bayesian approach to differential prevalence analysis with applications in microbiome studies

TL;DR

, with covariates and sequencing depth accounted for in a logistic regression framework. Posterior inference is obtained via No-U-Turn Sampling in Stan, yielding multiplicity-adjusted uncertainty intervals and finite estimates even in boundary cases. On 80 original datasets from 67 gut microbiome studies, DiPPER shows high sensitivity and strong cross-study replication relative to frequentist DPA and DAA methods, while providing interpretable differential prevalence estimates and scalable uncertainty; robustness to hyperpriors and potential extensions to differential abundance analysis are discussed. Practical implications include more reliable detection of disease-associated presence/absence signals and reduced reliance on p-value corrections, with potential applicability to other omics domains.

Abstract

Paper Structure (34 sections, 5 equations, 11 figures, 4 tables)

This paper contains 34 sections, 5 equations, 11 figures, 4 tables.

Introduction
Methods
Schematic illustration of differential prevalence analysis
DiPPER -- A Bayesian hierarchical model for differential prevalence analysis
Posterior sampling
Performance evaluation and benchmarking
Original datasets
Definition of statistical significance
Performance metrics
Null data error rate
Number of significant findings
Cross-study replicability
Compared methods
Results
Illustrative examples
...and 19 more sections

Figures (11)

Figure 1: Schematic illustration of differential prevalence analysis. a) A presence/absence matrix for five taxonomic features (e.g., species or genera) across five control and five case subjects (samples). b) Results of DPA, i.e., the estimated differential prevalence effects with uncertainty intervals for the five features shown in a). The question marks indicate boundary cases (features C and E) where the prevalence is either 0% or 100% in one of the two groups. In such scenarios, some frequentist methods fail to yield finite point estimates, confidence intervals or p-values. OR = Odds ratio
Figure 2: Structure of DiPPER. a) Directed acyclic graph of the model hierarchy. The hyperparameters $\tau_{0}$ and $\nu_{0}$ determine the scale (width) and skewness of the prior for the differential prevalence parameters $\beta_{1}, \dots, \beta_{K}$. The nuisance parameters refer to intercepts ($\alpha_{\cdot}$) and regression coefficients for covariates ($\beta_{\cdot}^{\,\cdots}$), while $\mathbf{y}_{1}, \dots, \mathbf{y}_{K}$ indicate the observed presence/absence data vectors. b) The half-normal prior for the global scale $\tau_{0}$. c) The Laplace prior for the skewness parameter $\nu_{0}$. d) The asymmetric Laplace prior for each parameter $\beta_{j}$ under four illustrative combinations of $\tau_{0}$ and $\nu_{0}$. The index $j = 1,\dots, K$ indicates features.
Figure 3: Illustration of DiPPER performance and comparison with frequentist logistic regression (Wald). a) DPA results for 25 species (the first 25 in alphabetical order) in a null dataset where "case" and "control" groups (N = 31 and 30) were randomly assigned among healthy subjects in a gut microbiome study Zeller2014PotentialCancer.. b) Results for 25 species in a dataset comparing healthy subjects (N = 30) and subjects with CRC (N = 30) Gupta2019AssociationIndia. In both panels, the points indicate the median posterior (DiPPER) or maximum likelihood (frequentist) differential prevalence estimates. The error bars represent $90\%$ credible intervals (left), unadjusted $90\%$ CIs (middle), or Bonferroni-adjusted $90\%$ CIs based on the Wald approximation (right). N/A indicates a non-finite result.
Figure 4: Performance of DiPPER and competing frequentist DPA and DAA methods on 480 null datasets and on 80 original datasets. a) x-axis: Null data error rate ($\lambda$), defined as the proportion of the 480 null datasets in which any significant findings were made. Ideally, this proportion should be at or below the significance level $\alpha = 0.10$ (vertical dashed line). The error bars indicate $90\%$ confidence intervals for the error rate estimates. y-axis: Median number of significant findings across the 80 original datasets. b) The number of significant findings in each of the 80 original datasets. Note the logarithmic scale. The methods are order by the median number of significant findings.
Figure 5: Replication of DPA and DAA results across studies. Replication was evaluated in 110 pairs of datasets, with each pair consisting of datasets from studies examining the same disease and utilizing the same sequencing methods (either 16S or shotgun). a) The definition of replicated and conflicting results. b) The number of replicated results across the study pairs (significance level $\alpha = 0.10$). The methods are ordered by the 75% quantile of the number of replicated findings. c) The total number of replicated results plotted against the total number of conflicting results at varying significance levels.
...and 6 more figures

A Bayesian approach to differential prevalence analysis with applications in microbiome studies

TL;DR

Abstract

A Bayesian approach to differential prevalence analysis with applications in microbiome studies

Authors

TL;DR

Abstract

Table of Contents

Figures (11)