Table of Contents
Fetching ...

Learning functional groups in complex microbiomes

Matthew S Schmitt, Kiseok Lee, Freddy Bunbury, Joseph A Landsittel, Vincenzo Vitelli, Seppe Kuehn

TL;DR

This work introduces a data-driven approach that explains how community function can be traced to just a few groups of microbes or genes, and illustrates how to do function-informed dimensionality reduction in biology.

Abstract

From soil to the gut, communities composed of thousands of microbes perform functions such as carbon sequestration and immune system regulation. Here, we introduce a data-driven approach that explains how community function can be traced to just a few groups of microbes or genes. In gut communities, our neural-network based clustering algorithm correctly recovers known functional groups. In the ocean metagenome, it distills ~500 gene modules down to three sparse groups highlighting survival strategies at different depths. In soils, it distills ~4400 bacterial species into two groups that enter a mathematical model of nitrate metabolism. By combining interpretable ML with strain isolation and sequencing experiments, we connect the metabolic specialization of each group to community-wide responses to perturbations. This integrated approach yields simple structure-function maps of microbiomes, allowing the discovery of molecular mechanisms underlying human and environmental health. More broadly, we illustrate how to do function-informed dimensionality reduction in biology.

Learning functional groups in complex microbiomes

TL;DR

This work introduces a data-driven approach that explains how community function can be traced to just a few groups of microbes or genes, and illustrates how to do function-informed dimensionality reduction in biology.

Abstract

From soil to the gut, communities composed of thousands of microbes perform functions such as carbon sequestration and immune system regulation. Here, we introduce a data-driven approach that explains how community function can be traced to just a few groups of microbes or genes. In gut communities, our neural-network based clustering algorithm correctly recovers known functional groups. In the ocean metagenome, it distills ~500 gene modules down to three sparse groups highlighting survival strategies at different depths. In soils, it distills ~4400 bacterial species into two groups that enter a mathematical model of nitrate metabolism. By combining interpretable ML with strain isolation and sequencing experiments, we connect the metabolic specialization of each group to community-wide responses to perturbations. This integrated approach yields simple structure-function maps of microbiomes, allowing the discovery of molecular mechanisms underlying human and environmental health. More broadly, we illustrate how to do function-informed dimensionality reduction in biology.
Paper Structure (16 sections, 9 equations, 24 figures, 1 table)

This paper contains 16 sections, 9 equations, 24 figures, 1 table.

Figures (24)

  • Figure 1: Data-driven discovery of functional groups and their dynamics (a) Microbial communities perform crucial environmental functions from the soil to the ocean to the gut. (b) In soils, microbes collectively reduce nitrate to dinitrogen gas in a process called denitrification. This process is composed of several discrete steps (black arrows), one of which produces the potent greenhouse gas nitrous oxide. (c) Our machine learning method, SCiFI, automatically finds the functional groups of bacteria (blue, green) that contribute to community function. These groups correspond to distinct metabolic functions (arrows). (d) By sequencing the genomes of group members, we can disentangle the biological mechanisms leading to collective function.
  • Figure 2: SCiFI: a neural network-based method to identify functional groups (a) Our pipeline consists of two steps: first, species abundances are aggregated via matrix multiplication with a grouping matrix; second, group abundances are used to predict the target function using a neural network. An optional gating term allows entire rows of the grouping matrix to be set to zero (Methods). During training, the grouping matrix is learned simultaneously with the neural network using gradients from the loss function. (b) A simple model of nonlinear data, in which the function depends on the inputs only through grouped abundances (Methods). (c) We compare our method to three alternative models which lack SCiFI's ability to capture a non-linear function map, find function-informed clusters, or both. Each method is described in the main text. (d) $R^2$ of function predictions on a held-out subset of the data for each method. (e) Group recovery for each method as measured by Jaccard Index (Methods).
  • Figure 3: SCiFI correctly learns functional groups in the gut, soil, and ocean microbiome (a) We train our model to predict butyrate production based on the abundances of 30 bacterial strains in synthetic gut communities (data from Ref. Clark2021). Test loss (mean squared error) of predicted butyrate concentrations is shown for varying numbers of functional groups $N_{\text{cluster}}$. Each faint dot is one model trained with one particular test/train split of the data. The solid dot shows the median across the entire ensemble of models ($N=12$ test/train splits total). Dashed gray line shows the variance of butyrate as a reference. (b) Flow chart showing how group structure changes with varying $N_{\text{cluster}}$. Thickness of each bar corresponds to the average abundance of that species across the dataset. Several individual species are highlighted; for interpretation see the main text. (c) SCiFI predictions (with $N_{\text{cluster}}=4$) versus observed butyrate concentrations in each sample. (d) Comparison of test-set $R^2$ values for three models: the neural network which uses the identified cluster abundances as inputs; a linear regression which uses the cluster abundances as inputs; and a linear regression which uses the projection onto the first 4 principal components as inputs. Each point is a model trained and evaluated using a different test-train split of the data (Methods). All models use a 4-dimensional input (that is, either $N_{\text{clusters}}=4$ or $N_{\text{PCs}}=4$). (e-h) Same plots as in the top row, using succinate as the target function. The models in (g) and (h) use $N_{\text{clusters}}=2$. (i-l) Results for genus abundances in marine communities. Target function is a scalar measurement of environmental nitrate concentration. The models in (k) and (l) use $N_{\text{clusters}}=2$. (m-p) Results for phylum abundances in soil communities. Target function is a time-series of nitrate concentrations. The models in (o) and (p) use $N_{\text{clusters}}=3$. All of these models in this figure are trained without the optional gating step described in Figure \ref{['fig:the_model']}a. For each dataset, a comparison to several other methods may be found in SI Fig. \ref{['si_fig:fig3_method_comparison']}.
  • Figure 4: Sparse functional gene groups reveal survival strategies in the ocean microbiome (a) We use SCiFI to cluster gene modules using environmental parameters as a proxy for function. Structure in this dataset is quantified via shotgun metagenomics which quantifies the abundances of different genes across all genomes in the sample. Environmental variables include both those related to community function (e.g. oxygen, nitrate concentrations) and to abiotic forcing (e.g. temperature). (b, left) Average nitrate concentration, oxygen concentration, and temperature as a function of depth. Quantities are normalized to have mean zero and standard deviation of one. (b, right) Average group abundance for the three groups identified with our algorithm, normalized by max/min. (c) Correlation of group abundances with each of the three environmental parameters used as target function during training: nitrate, oxygen, and temperature. (d) Allocation of several selected KO pathways to our learned groups. Bar heights denote the average abundance of each module assigned to each group. Average is taken across samples and across 12 test-train splits of the dataset; error bars show one standard deviation across test-train splits.
  • Figure 5: Learned functional groups are the relevant variables of minimal dynamical models (a, left) In soil microcosm experiments quantifying nitrate utilization lee2024functional, SCiFI finds that only two groups are necessary. (right) These group abundances, together with nitrate measurements, are described by a simple consumer resource model. (b) Evolution of group abundances $x_1$ (blue) and $x_2$ (red) in time. Measurements are taken at $t=0$ and $t=96$h after incubation, with three replicates per sample (each point is one replicate). Colored lines denote inferred dynamics from the consumer resource model, where we take the average across replicates Eq. \ref{['eq:consumer_resource']}. (c) Evolution of nitrate concentration over time. Concentrations are measured at ten time points. Each gray line is the consumer-resource prediction for one replicate.
  • ...and 19 more figures