Pre-trained protein language model for codon optimization

Shashank Pathak; Guohui Lin

Pre-trained protein language model for codon optimization

Shashank Pathak, Guohui Lin

TL;DR

The ORFs generated by the proposed models outperformed their natural counterparts encoding the same proteins on computational metrics for stability and expression and demonstrated enhanced performance against the benchmark ORFs used in mRNA vaccines for the SARS-CoV-2 viral spike protein and the varicella-zoster virus.

Abstract

Motivation: Codon optimization of Open Reading Frame (ORF) sequences is essential for enhancing mRNA stability and expression in applications like mRNA vaccines, where codon choice can significantly impact protein yield which directly impacts immune strength. In this work, we investigate the use of a pre-trained protein language model (PPLM) for getting a rich representation of amino acids which could be utilized for codon optimization. This leaves us with a simpler fine-tuning task over PPLM in optimizing ORF sequences. Results: The ORFs generated by our proposed models outperformed their natural counterparts encoding the same proteins on computational metrics for stability and expression. They also demonstrated enhanced performance against the benchmark ORFs used in mRNA vaccines for the SARS-CoV-2 viral spike protein and the varicella-zoster virus (VZV). These results highlight the potential of adapting PPLM for designing ORFs tailored to encode target antigens in mRNA vaccines.

Pre-trained protein language model for codon optimization

TL;DR

Abstract

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

Figure 1: Codon Mask for each $t^{th}$ amino acid ($CV_{t^{th} amino\_acid}$)
Figure 2: In this figure we compare the results of different species (Human, E.coli and CHO) on their fine-tuned model ORFs from Adasel-ProtBert-short, Adasel-ProtBert-E.coli and Adasel-ProtBert-CHO respectively. (a) The expression and stability evaluation of Adasel-ProtBert-short on ORF optimization across species with their respective wild type ORFs. (b) The expression and stability evaluation of Adasel-ProtBert-E.coli on ORF optimization across species with their respective wild type ORFs. (c) The expression and stability evaluation of Adasel-ProtBert-CHO on ORF optimization across species with their respective wild type ORFs.
Figure 3: ORFs from Adasel-ProtBert-long-mfe, Adasel-ProtBert-random-mfe, Adasel-ProtBert-short, Adaptive-ProtBert-short, and Bi-LSTM are the ones that were trained on Hg19 genes, whereas Adasel-ProtBert-Ecoli and Adasel-ProtBert-CHO are the models trained on E.coli and CHO genes respectively. The Covid19-Wild-Type is the naturally occurring ORF sequence found in the SARS-CoV-2 virus. (a) Th ORF sequences from design types with their CAI (the higher the better) and GC-Content (optimal range is 30-70%) (b) Different design types structural stability values in terms of MFE (lower the MFE value, the higher the stability). (c) Comparison of expression level with stability for ORFs are illustrated here
Figure 4: The design types Adasel-ProtBert, Adaptive-ProtBert, and Bi-LSMT are models trained on Hg19 genes, whereas Adasel-ProtBert-Ecoli and Adasel-ProtBert-CHO are trained on E.coli and Chinese-Hamster genes respectively. Wild-Type is the naturally occurring ORF sequence found in the VZV. (a) Th ORF sequences from design types with their CAI (the higher the better) and GC-Content (optimal range is 30-70%) (b) Different design types structural stability values where lower the MFE value, the higher the stability. (c) Comparison of expression level with stability for ORFs are illustrated here
Figure 5: Codon Optimization flow chart as a sequence tagging task. First, the input protein sequence is chunked into individual tokens of amino acids. Each tokenized amino acid is passed to a neural network (Encoder) to capture rich context-aware representations. Classifier layer (FNN) + Codon Mask are applied in a time-distributed way to tag optimal codon out of 61 for each amino acid. The final output will be a sequence of codons i.e optimized ORF.
...and 1 more figures