Table of Contents
Fetching ...

Integrating Deep Learning and Synthetic Biology: A Co-Design Approach for Enhancing Gene Expression via N-terminal Coding Sequences

Zhanglu Yan, Weiran Chu, Yuhua Sheng, Kaiwen Tang, Shida Wang, Yanfeng Liu, Weng-Fai Wong

TL;DR

This study tackles optimizing N-terminal coding sequences (NCS) to maximize translation initiation and gene expression. It introduces a deep learning/synthetic biology co-design workflow that uses $k$-nearest encoding with $k=3$ and Word2Vec CBOW embeddings, followed by attention-LSTM embeddings and a time-series predictor to forecast expression and drive a direct-search optimization under limited data. Across six iterative experiments, the approach yields NCS MLD$_{62}$ with a $5.41$-fold increase in GFP expression in Bacillus subtilis, surpassing endogenous and published NCS designs. The method is demonstrated to improve NeuAc production by regulating GNA1, achieving a $1.25$-fold increase over WT and opening an open-source GFP NCS expression dataset and protocols for public use. This work offers a data-efficient, transferable framework for NCS design with practical implications for metabolic engineering.

Abstract

N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. This paper introduces a deep learning/synthetic biology co-designed few-shot training workflow for NCS optimization. Our method utilizes k-nearest encoding followed by word2vec to encode the NCS, then performs feature extraction using attention mechanisms, before constructing a time-series network for predicting gene expression intensity, and finally a direct search algorithm identifies the optimal NCS with limited training data. We took green fluorescent protein (GFP) expressed by Bacillus subtilis as a reporting protein of NCSs, and employed the fluorescence enhancement factor as the metric of NCS optimization. Within just six iterative experiments, our model generated an NCS (MLD62) that increased average GFP expression by 5.41-fold, outperforming the state-of-the-art NCS designs. Extending our findings beyond GFP, we showed that our engineered NCS (MLD62) can effectively boost the production of N-acetylneuraminic acid by enhancing the expression of the crucial rate-limiting GNA1 gene, demonstrating its practical utility. We have open-sourced our NCS expression database and experimental procedures for public use.

Integrating Deep Learning and Synthetic Biology: A Co-Design Approach for Enhancing Gene Expression via N-terminal Coding Sequences

TL;DR

This study tackles optimizing N-terminal coding sequences (NCS) to maximize translation initiation and gene expression. It introduces a deep learning/synthetic biology co-design workflow that uses -nearest encoding with and Word2Vec CBOW embeddings, followed by attention-LSTM embeddings and a time-series predictor to forecast expression and drive a direct-search optimization under limited data. Across six iterative experiments, the approach yields NCS MLD with a -fold increase in GFP expression in Bacillus subtilis, surpassing endogenous and published NCS designs. The method is demonstrated to improve NeuAc production by regulating GNA1, achieving a -fold increase over WT and opening an open-source GFP NCS expression dataset and protocols for public use. This work offers a data-efficient, transferable framework for NCS design with practical implications for metabolic engineering.

Abstract

N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. This paper introduces a deep learning/synthetic biology co-designed few-shot training workflow for NCS optimization. Our method utilizes k-nearest encoding followed by word2vec to encode the NCS, then performs feature extraction using attention mechanisms, before constructing a time-series network for predicting gene expression intensity, and finally a direct search algorithm identifies the optimal NCS with limited training data. We took green fluorescent protein (GFP) expressed by Bacillus subtilis as a reporting protein of NCSs, and employed the fluorescence enhancement factor as the metric of NCS optimization. Within just six iterative experiments, our model generated an NCS (MLD62) that increased average GFP expression by 5.41-fold, outperforming the state-of-the-art NCS designs. Extending our findings beyond GFP, we showed that our engineered NCS (MLD62) can effectively boost the production of N-acetylneuraminic acid by enhancing the expression of the crucial rate-limiting GNA1 gene, demonstrating its practical utility. We have open-sourced our NCS expression database and experimental procedures for public use.
Paper Structure (18 sections, 1 equation, 6 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Gene adjusting methods.
  • Figure 2: Workflow of our training methods.
  • Figure 3: NCS expression intensity analysis.
  • Figure 4: Efficient Synthesis of N-Acetylneuraminic Acid (NeuAc) using MLD-NCSs. (a) Synthetic pathway of NeuAc in Bacillus subtilis. The pathway involves several key genes (highlighted in red) and metabolic products including F6P (fructose-6-phosphate), GlcN6P (glucosamine-6-phosphate), GlcNAc6P (N-acetylglucosamine-6-phosphate), GlcNAc (N-acetylglucosamine), ManNAc (N-acetylmannosamine), and NeuAc (N-acetylneuraminic acid). Key genes in the pathway are glmS (glutamine-fructose-6-phosphate aminotransferase), GNA1 (glucosamine-6-phosphate N-acetyltransferase), yqaB (N-acetylglucosamine-6-phosphate phosphatase), age (N-acetylglucosamine-2-epimerase), and neuB (N-acetylneuraminic acid synthase), with ptsG (phosphotransferase) also involved. GNA1, the rate-limiting enzyme, was regulated by NCS. (b) NeuAc synthesis strategy in Bacillus subtilis. The genomic integration of glmS and yqaB genes, coupled with the deletion of ptsG and plasmid-based expression of GNA1, age, and neuB genes, is employed. GNA1, as the critical rate-limiting step, is targeted for regulation by NCS at its N-terminal. (c) Fermentation Results. The optimized MLD$_{62}$ variant showed a 25.3% increase in NeuAc production compared to the wild type, and a 9.6% increase over the most potent natural NCS variant.
  • Figure 5: NCS expression intensity for two more addition rounds
  • ...and 1 more figures