Table of Contents
Fetching ...

Deep learning-guided evolutionary optimization for protein design

Erik Hartman, Di Tang, Johan Malmström

TL;DR

BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space, accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives.

Abstract

Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{https://github.com/ErikHartman/bopep}{GitHub}.

Deep learning-guided evolutionary optimization for protein design

TL;DR

BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space, accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives.

Abstract

Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{https://github.com/ErikHartman/bopep}{GitHub}.
Paper Structure (16 sections, 13 equations, 3 figures, 1 algorithm)

This paper contains 16 sections, 13 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: BoGA couples evolutionary optimization with Bayesian selection for sequence design.a Schematic of the optimization cycle. From an initial evaluated dataset $\mathcal{D}_0$, an elite set $S_k$ is chosen from a leaderboard where each sequence is ranked by its fitness $f(\mathbf{x})$. A proposer (genetic mutations: substitutions/insertions/deletions) generates a pool of candidates $\mathcal{X}'$ of size $k_{\text{propose}}$. Sequences are embedded and scored by a surrogate model $\hat{f}_\theta$; an acquisition function $\alpha$ selects the $m_{\text{select}}$ top candidates for explicit evaluation. The fitness function is determined by the user, and can utilize properties derived from the amino acid sequence, the folded structure, and/or interchain interactions defined by a folded complex. The newly scored sequences are added to the dataset $\mathcal{D}_t$ and the loop is repeated. b Conceptual search in sequence space. A standard GA (orange) evaluates many proposals, including those that decrease fitness, whereas BoGA (blue) uses the surrogate to discard low-value proposals (gray) and concentrates evaluations to proposals that are more likely to increase fitness. c Example of optimization trajectories when maximizing molecular weight, showing improved performance as $k_{\text{propose}}$ increases.
  • Figure 2: BoGA improves optimization efficiency for sequence- and structure-level objectives.a Optimization of $\beta$-sheet fraction (fraction of E, M, A, and L residues). The left panel shows optimization trajectories for different values of $k_{\text{propose}}$, and the right panel shows the distribution of fitness values at various stages of optimization. The inset shows the surrogate model $R^2$ over the course of optimization. b Similar layout as in $\textbf{a}$ but the objective function is the normalized hydrophobic moment (uHrel). c Structure-guided optimization using AlphaFold 2-predicted secondary structure weighted by predicted TM-score (pTM). The left panel shows optimization trajectories and violinplots show the distribution of fitness values for the last 10 generations. The right panel shows a triplot of the secondary structures of samples for the run with $k_{\text{propose}} = 500$.
  • Figure 3: BoGA enables efficient design of peptide binders targeting pneumolysin.a Illustration of the neutralizing monoclonal antibody and PLY complex. PLY is shown in gray and the antibody in green. The model at the bottom represents domain 4. b Optimization trajectories for the binder score across 100 generations for $k_{\mathrm{propose}} = 10$ (teal) and $k_{\mathrm{propose}} = 500$ (purple). The points indicate individual evaluated candidates; solid lines are running top quartile means. Right: Kernel density estimates of the binding score distribution during early (0-10), intermediate (10-50), and late (50-100) generations, showing that larger proposal pools accelerate discovery of high-scoring binders. c Relationship between predicted interface pTM (ipTM) and peptide predicted aligned error (PAE) for evaluated sequences. Larger $k_{\mathrm{propose}}$ increases sampling of high-confidence, low-PAE binders. The dashed lines show cutoffs for ipTM=0.9 and PAE=5 Å. d Outline of the post-optimization refinement approach. The top 100 BoGA sequences were subjected to three rounds of sequence recovery using ProteinMPNN and FastRelax, resulting in a total of 400 candidates, which are subsequently re-docked to the full-length PLY and scored. e pLDDT versus ipTM for refined candidates. Points are colored by Boltz-2 interface distance score. High-confidence binders cluster at high ipTM and high pLDDT (upper right quadrant, dashed lines). f Example predicted complexes (Boltz-2) of top-performing binders (cyan) bound to PLY (gray). Filtering of the top candidates using quality, solubility and structural consistency yielded a final set of 41 high-confidence binders. g Orthogonal structure prediction using AlphaFold 3 and Boltz 2 for one of the top binders. Both models support a consistent binding pose with high interface confidence (ipTM = 0.85 for AlphaFold 3; ipTM = 0.92 for Boltz 2). Estimated binding free energy ($\Delta G$) from PyRosetta pyrosetta indicates a strongly favorable interaction.