Table of Contents
Fetching ...

CodonMPNN for Organism Specific and Codon Optimal Inverse Folding

Hannes Stark, Umesh Padia, Julia Balla, Cameron Diao, George Church

TL;DR

CodonMPNN addresses the problem of organism-specific codon optimization for protein expression by directly generating codon sequences conditioned on a protein backbone structure and host taxonomy. It extends the ProteinMPNN framework to output 64 codons and adds taxon conditioning via a tree-partitioned taxonomy embedding, enabling host-aware optimization without altering the designed amino acid sequence. Empirical results show CodonMPNN maintains amino-acid recovery and designability comparable to ProteinMPNN while achieving higher codon recovery than naive frequency-based baselines, and a synonymous-mutation analysis shows 72.4% of pairs favor the higher-expression codon. Overall, CodonMPNN bridges structure-based design and expression optimization, offering a practical, drop-in replacement for inverse folding with organ-specific codon generation.

Abstract

Generating protein sequences conditioned on protein structures is an impactful technique for protein engineering. When synthesizing engineered proteins, they are commonly translated into DNA and expressed in an organism such as yeast. One difficulty in this process is that the expression rates can be low due to suboptimal codon sequences for expressing a protein in a host organism. We propose CodonMPNN, which generates a codon sequence conditioned on a protein backbone structure and an organism label. If naturally occurring DNA sequences are close to codon optimality, CodonMPNN could learn to generate codon sequences with higher expression yields than heuristic codon choices for generated amino acid sequences. Experiments show that CodonMPNN retains the performance of previous inverse folding approaches and recovers wild-type codons more frequently than baselines. Furthermore, CodonMPNN has a higher likelihood of generating high-fitness codon sequences than low-fitness codon sequences for the same protein sequence. Code is available at https://github.com/HannesStark/CodonMPNN.

CodonMPNN for Organism Specific and Codon Optimal Inverse Folding

TL;DR

CodonMPNN addresses the problem of organism-specific codon optimization for protein expression by directly generating codon sequences conditioned on a protein backbone structure and host taxonomy. It extends the ProteinMPNN framework to output 64 codons and adds taxon conditioning via a tree-partitioned taxonomy embedding, enabling host-aware optimization without altering the designed amino acid sequence. Empirical results show CodonMPNN maintains amino-acid recovery and designability comparable to ProteinMPNN while achieving higher codon recovery than naive frequency-based baselines, and a synonymous-mutation analysis shows 72.4% of pairs favor the higher-expression codon. Overall, CodonMPNN bridges structure-based design and expression optimization, offering a practical, drop-in replacement for inverse folding with organ-specific codon generation.

Abstract

Generating protein sequences conditioned on protein structures is an impactful technique for protein engineering. When synthesizing engineered proteins, they are commonly translated into DNA and expressed in an organism such as yeast. One difficulty in this process is that the expression rates can be low due to suboptimal codon sequences for expressing a protein in a host organism. We propose CodonMPNN, which generates a codon sequence conditioned on a protein backbone structure and an organism label. If naturally occurring DNA sequences are close to codon optimality, CodonMPNN could learn to generate codon sequences with higher expression yields than heuristic codon choices for generated amino acid sequences. Experiments show that CodonMPNN retains the performance of previous inverse folding approaches and recovers wild-type codons more frequently than baselines. Furthermore, CodonMPNN has a higher likelihood of generating high-fitness codon sequences than low-fitness codon sequences for the same protein sequence. Code is available at https://github.com/HannesStark/CodonMPNN.
Paper Structure (7 sections, 4 figures, 1 table, 1 algorithm)

This paper contains 7 sections, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Amino acid sequences have corresponding DNA sequences with triplets of nucleotides (A, C, G, U) corresponding to amino acids. Since there are 64 possible triplets called codons and only 20 amino acids, there are multiple codon sequences for each protein sequence. Some have higher expression rates than others, and some are not expressed at all. This expression level depends on the host in which the codon sequence is expressed.
  • Figure 2: CodonMPNN overview. In the prevailing approach (A), an inverse folding model, such as ProteinMPNN, generates an amino acid sequence. For experimental validation, this is mapped to a codon sequence (DNA sequence) via heuristic codon optimization tools and expressed in a specific system. As an alternative, we propose CodonMPNN (B), which directly generates codon sequences conditioned on a structure and the taxon label of the host organism in which the codon sequence should be expressed.
  • Figure 3: Recovery rates per amino acid types.Codon Recovery and Amino Acid Recovery show the recovery rates of CodonMPNN's generated sequences. Naive Codon Recovery is the recovery rate of codon sequences obtained by translating CodonMPNN's codons to amino acids and choosing their most frequent codons. Oracle Codon Recovery shows the same for the ground truth amino acids.
  • Figure 4: Likelihoods For Synonymous Coding Sequences. Each point is a pair of synonymous coding sequences. Points above the dashed line correspond to correct predictions. Pair difference in expression yields is the difference between the higher and lower expression yields of sequences in each pair. Pair difference in log-likelihoods is the difference in log-likelihoods between the highly- and lowly-expressed sequences.