Table of Contents
Fetching ...

Omics-driven hybrid dynamic modeling of bioprocesses with uncertainty estimation

Sebastián Espinel-Ríos, José Montaño López, José L. Avalos

TL;DR

This study presents an omics-driven hybrid dynamic modeling pipeline that fuses mechanistic growth dynamics with data-driven components trained on reduced omics features. Random forests perform feature reduction and permutation-based ranking to identify a small set of intracellular proteins that correlate with growth; Gaussian processes then map these features to time-varying model parameters, enabling uncertainty quantification in multiscale predictions for Saccharomyces cerevisiae. The approach demonstrates a 350-sample proteomics dataset, seven kinetic experiments in liquid media, and a compact seven-feature vector that suffices to reproduce growth trajectories within confidence intervals. The framework offers a scalable path to incorporate more omics layers and larger datasets, potentially advancing smart bioprocessing and model-based design in biotechnology.

Abstract

This work presents an omics-driven modeling pipeline that integrates machine-learning tools to facilitate the dynamic modeling of multiscale biological systems. Random forests and permutation feature importance are proposed to mine omics datasets, guiding feature selection and dimensionality reduction for dynamic modeling. Continuous and differentiable machine-learning functions can be trained to link the reduced omics feature set to key components of the dynamic model, resulting in a hybrid model. As proof of concept, we apply this framework to a high-dimensional proteomics dataset of $\textit{Saccharomyces cerevisiae}$. After identifying key intracellular proteins that correlate with cell growth, targeted dynamic experiments are designed, and key model parameters are captured as functions of the selected proteins using Gaussian processes. This approach captures the dynamic behavior of yeast strains under varying proteome profiles while estimating the uncertainty in the hybrid model's predictions. The outlined modeling framework is adaptable to other scenarios, such as integrating additional layers of omics data for more advanced multiscale biological systems, or employing alternative machine-learning methods to handle larger datasets. Overall, this study outlines a strategy for leveraging omics data to inform multiscale dynamic modeling in systems biology and bioprocess engineering.

Omics-driven hybrid dynamic modeling of bioprocesses with uncertainty estimation

TL;DR

This study presents an omics-driven hybrid dynamic modeling pipeline that fuses mechanistic growth dynamics with data-driven components trained on reduced omics features. Random forests perform feature reduction and permutation-based ranking to identify a small set of intracellular proteins that correlate with growth; Gaussian processes then map these features to time-varying model parameters, enabling uncertainty quantification in multiscale predictions for Saccharomyces cerevisiae. The approach demonstrates a 350-sample proteomics dataset, seven kinetic experiments in liquid media, and a compact seven-feature vector that suffices to reproduce growth trajectories within confidence intervals. The framework offers a scalable path to incorporate more omics layers and larger datasets, potentially advancing smart bioprocessing and model-based design in biotechnology.

Abstract

This work presents an omics-driven modeling pipeline that integrates machine-learning tools to facilitate the dynamic modeling of multiscale biological systems. Random forests and permutation feature importance are proposed to mine omics datasets, guiding feature selection and dimensionality reduction for dynamic modeling. Continuous and differentiable machine-learning functions can be trained to link the reduced omics feature set to key components of the dynamic model, resulting in a hybrid model. As proof of concept, we apply this framework to a high-dimensional proteomics dataset of . After identifying key intracellular proteins that correlate with cell growth, targeted dynamic experiments are designed, and key model parameters are captured as functions of the selected proteins using Gaussian processes. This approach captures the dynamic behavior of yeast strains under varying proteome profiles while estimating the uncertainty in the hybrid model's predictions. The outlined modeling framework is adaptable to other scenarios, such as integrating additional layers of omics data for more advanced multiscale biological systems, or employing alternative machine-learning methods to handle larger datasets. Overall, this study outlines a strategy for leveraging omics data to inform multiscale dynamic modeling in systems biology and bioprocess engineering.

Paper Structure

This paper contains 14 sections, 22 equations, 11 figures.

Figures (11)

  • Figure 1: Scheme of multi-omics data and its relation to different levels of cellular processes that ultimately determine the cell phenotype. Differential equations can be used to model the complex interactions between these cellular processes, including regulatory mechanisms. For simplicity, the genetic information (genome) is assumed to be constant in the cell, hence no dynamic equation is formulated. Refer to Section \ref{['sec:materials_methods']} for details on notation. The figure contains images from https://www.biorender.com.
  • Figure 2: Pipeline of the proposed omics-driven hybrid modeling framework. The black arrows indicate the general flow of the pipeline. Random forests are used to rank feature importance from raw high-dimensional omics data, resulting in a set of selected features. Experiments can be designed to explore the effects of changes in these selected features on dynamic cell behavior. Gaussian processes are then employed to link key parameters of the dynamic model to changes in the selected features, resulting in a hybrid model. The uncertainty from the Gaussian processes is propagated into the time domain, enabling the estimation of uncertainty in the predicted dynamic trajectories. Refer to Section \ref{['sec:materials_methods']} for details on notation. The figure contains images from https://www.biorender.com. The image of the Gaussian process was generated using the demo tool available at http://chifeng.scripts.mit.edu/stuff/gp-demo/.
  • Figure 3: Parity plots evaluated on the A) training and B) test subsets for the random forest model with optimized hyperparameters. Number of trees: 100, number of features considered for splitting at each node: $\sqrt{n_v}$, maximum depth: 20, minimum samples for node splitting: 7, minimum samples for leaf nodes: 2, and bootstrapping: false. The prediction uncertainty is represented by the standard deviation of the predictions from the individual decision trees within the random forest.
  • Figure 4: Top 50 important features (intracellular protein concentrations) according to the permutation importance metric computed from the trained random forest with optimal hyperparameters. The cumulative importance is also shown.
  • Figure 5: Increasing number of important features (cf. Fig. \ref{['fig:feature_importance_plot']}), up to 20 features, on the $R^2$ value of the random-forest models evaluated using the test set. Each random forest followed individual grid-search hyperparameter optimization. The intersection (in blue) of the dotted lines denotes the earliest$R^2$ value that is equivalent to that of the optimized random forest using all the features.
  • ...and 6 more figures