Table of Contents
Fetching ...

An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae

Neha K. Nair, Aaron D'Souza

Abstract

Saccharomyces cerevisiae is a cornerstone organism in industrial biotechnology, valued for its genetic tractability and robust fermentative capacity. Accurately predicting biomass flux across diverse environmental and genetic perturbations remains a significant challenge for rational strain design. We present a computational framework combining the Yeast9 genome-scale metabolic model with machine learning and optimization to predict, interpret, and enhance biomass flux. Flux balance analysis generated 2,000 flux profiles by varying glucose, oxygen, and ammonium uptake rates. Random Forest and XGBoost regressors achieved R2 of 0.99989 and 0.9990, respectively. A variational autoencoder revealed four distinct metabolic clusters, and SHAP analysis identified glycolysis, the TCA cycle, and lipid biosynthesis as key biomass determinants. In silico overexpression achieved a biomass flux of 0.979 gDW/hr, while Bayesian optimization of nutrient constraints produced a 12-fold increase (0.0858 to 1.041 gDW/hr). A generative adversarial network proposed stoichiometrically feasible novel flux configurations. This framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering.

An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae

Abstract

Saccharomyces cerevisiae is a cornerstone organism in industrial biotechnology, valued for its genetic tractability and robust fermentative capacity. Accurately predicting biomass flux across diverse environmental and genetic perturbations remains a significant challenge for rational strain design. We present a computational framework combining the Yeast9 genome-scale metabolic model with machine learning and optimization to predict, interpret, and enhance biomass flux. Flux balance analysis generated 2,000 flux profiles by varying glucose, oxygen, and ammonium uptake rates. Random Forest and XGBoost regressors achieved R2 of 0.99989 and 0.9990, respectively. A variational autoencoder revealed four distinct metabolic clusters, and SHAP analysis identified glycolysis, the TCA cycle, and lipid biosynthesis as key biomass determinants. In silico overexpression achieved a biomass flux of 0.979 gDW/hr, while Bayesian optimization of nutrient constraints produced a 12-fold increase (0.0858 to 1.041 gDW/hr). A generative adversarial network proposed stoichiometrically feasible novel flux configurations. This framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering.

Paper Structure

This paper contains 27 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Two-dimensional latent space learned by the VAE. Each point represents one of the 2,000 FBA-simulated flux profiles projected onto the two principal latent dimensions. Most profiles are concentrated in Latent Dim 1 $\in [-5, 5]$, Latent Dim 2 $\in [-4, 4]$.
  • Figure 2: Cluster number selection diagnostics. Left: Elbow method (inertia vs. $k$), decreasing from $\approx4350$ at $k{=}2$ to $\approx1500$ at $k{=}9$. Right: Silhouette score vs. $k$, with peak at $k{=}2$ ($\approx0.341$) and secondary peak at $k{=}6$ ($\approx0.326$), supporting the selection of $k{=}4$.
  • Figure 3: K-means clustering of flux profiles in latent space ($k{=}4$). Cluster 1 (highest mean biomass flux, 0.5543 gDW$\cdot$hr$^{-1}$) is associated with the rightward region; remaining clusters occupy overlapping central and left regions.
  • Figure 4: Scatter plot of Random Forest-predicted versus true biomass flux values on the held-out test set. Points closely follow the identity line across 0.15--1.05 gDW$\cdot$hr$^{-1}$ (Test R2 = 0.99989).
  • Figure 5: Scatter plot of FFNN-predicted versus true biomass flux values on the held-out test set. Predicted values exhibit higher scatter relative to the Random Forest model, indicating that further hyperparameter optimisation is required.
  • ...and 7 more figures