Table of Contents
Fetching ...

Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples

Daniel Agyapong, Briana H. Beatty, Peter G. Kennedy, Jane C. Marks, Toby D. Hocking

TL;DR

This work tackles the challenge of inferring microbiome co-occurrence networks across heterogeneous environments by introducing the Same-All Cross-validation (SAC) framework, which separately evaluates within-habitat and cross-habitat generalization. It adapts the fused-lasso based fuser algorithm to microbiome data, decomposing edge weights into a global component and habitat-specific deviations, and optimizes a joint objective with sparsity and fusion penalties across habitats. Across six public grouped-sample datasets, fuser achieves comparable performance to glmnet in homogeneous settings and substantially improves predictive accuracy in cross-environment scenarios, with taxon-wise analyses revealing complementary strengths across methods. The study provides a principled toolbox for cross-environment network inference, offering improved robustness to environmental heterogeneity and more nuanced ecological insights into microbial interactions across space and time.

Abstract

Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples

TL;DR

This work tackles the challenge of inferring microbiome co-occurrence networks across heterogeneous environments by introducing the Same-All Cross-validation (SAC) framework, which separately evaluates within-habitat and cross-habitat generalization. It adapts the fused-lasso based fuser algorithm to microbiome data, decomposing edge weights into a global component and habitat-specific deviations, and optimizes a joint objective with sparsity and fusion penalties across habitats. Across six public grouped-sample datasets, fuser achieves comparable performance to glmnet in homogeneous settings and substantially improves predictive accuracy in cross-environment scenarios, with taxon-wise analyses revealing complementary strengths across methods. The study provides a principled toolbox for cross-environment network inference, offering improved robustness to environmental heterogeneity and more nuanced ecological insights into microbial interactions across space and time.

Abstract

Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

Paper Structure

This paper contains 16 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Same All Cross-validation (SAC) for microbiome network inference across habitats.
  • Figure 2: Conceptual diagram of regularization for microbiome network inference across habitats.
  • Figure 3: Performance comparison between fuser using "all" available subsets and cv_glmnet using "same" subsets only. The y-axis shows mean MSE difference with 95% confidence intervals between fuser(all) and cv_glmnet(same). Points below zero indicate fuser performs better when combining subsets compared to cv_glmnet trained on individual subsets, while points above zero indicate worse performance. The x-axis shows $\log_{10}$(p-value with 95% CI), with the vertical dashed line at $p = 0.05$ representing the significance threshold. All datasets show improved performance when fuser combines environmental subsets.
  • Figure 4: Performance comparison between fuser and cv_glmnet when both methods utilize all available subsets. The y-axis shows mean MSE difference with 95% confidence intervals between fuser(all) and cv_glmnet(all). Negative values (below zero) indicate fuser performs better, while positive values indicate cv_glmnet performs better. The x-axis shows $\log_{10}$(p-value with 95% CI), with the vertical dashed line at $p = 0.05$ representing the significance threshold.
  • Figure 5: Taxa-specific performance comparison across regularization methods in MovingPictures dataset. Each panel shows mean regression MSE with 95% confidence intervals for a representative taxon, comparing proposed_fuser(all), cv_glmnet(all), cv_glmnet(same), and featureless baseline algorithms. Different taxa exhibit distinct optimal methods: Taxa4383166 favors subset-specific modeling, Taxa4450795 benefits from fusion regularization, and Taxa4467447 performs best with standard regularization on combined data. P-values indicate statistical significance of performance differences.
  • ...and 1 more figures