Table of Contents
Fetching ...

Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model

Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, Bingxin Zhou

TL;DR

This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences that provides reliable predictions of mutation effects.

Abstract

Enzyme engineering enables the modification of wild-type proteins to meet industrial and research demands by enhancing catalytic activity, stability, binding affinities, and other properties. The emergence of deep learning methods for protein modeling has demonstrated superior results at lower costs compared to traditional approaches such as directed evolution and rational design. In mutation effect prediction, the key to pre-training deep learning models lies in accurately interpreting the complex relationships among protein sequence, structure, and function. This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences. The state-of-the-art performance of the proposed ProtREM is validated on over 2 million mutants across 217 assays from an open benchmark (ProteinGym). We also conducted post-hoc analyses of the model's ability to improve the stability and binding affinity of a VHH antibody. Additionally, we designed 10 new mutants on a DNA polymerase and conducted wet-lab experiments to evaluate their enhanced activity at higher temperatures. Both in silico and experimental evaluations confirmed that our method provides reliable predictions of mutation effects, offering an auxiliary tool for biologists aiming to evolve existing enzymes. The implementation is publicly available at https://github.com/tyang816/ProtREM.

Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model

TL;DR

This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences that provides reliable predictions of mutation effects.

Abstract

Enzyme engineering enables the modification of wild-type proteins to meet industrial and research demands by enhancing catalytic activity, stability, binding affinities, and other properties. The emergence of deep learning methods for protein modeling has demonstrated superior results at lower costs compared to traditional approaches such as directed evolution and rational design. In mutation effect prediction, the key to pre-training deep learning models lies in accurately interpreting the complex relationships among protein sequence, structure, and function. This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences. The state-of-the-art performance of the proposed ProtREM is validated on over 2 million mutants across 217 assays from an open benchmark (ProteinGym). We also conducted post-hoc analyses of the model's ability to improve the stability and binding affinity of a VHH antibody. Additionally, we designed 10 new mutants on a DNA polymerase and conducted wet-lab experiments to evaluate their enhanced activity at higher temperatures. Both in silico and experimental evaluations confirmed that our method provides reliable predictions of mutation effects, offering an auxiliary tool for biologists aiming to evolve existing enzymes. The implementation is publicly available at https://github.com/tyang816/ProtREM.

Paper Structure

This paper contains 34 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An illustrative workflow of ProtREM for predicting mutation effects. a. For a given template protein, ProtREM encodes structural, sequence, and MSA information to generate logits for each residue, which are used to calculate mutation fitness scores. b. For each AA, its local structure is clustered into $2048$ distinct structure tokens. c. The vector representations of structural and sequence information are integrated using disentangled cross-attention through BERT-style pre-training. d. Homologous information is retrieved via Jackhmmer and converted to a matrix representation of evolutionary logits.
  • Figure 2: A summary of baseline comparisons on the ProteinGym mutation effect prediction task. a. Performance ranks across each assay. for instance, a Rank $1$ (in dark green) for ProtREM with a value of $49$ indicates that ProtREM achieves the highest performance on $49$ out of $217$ assays. b. Performance of ProtREM's ablation models with various homologous sequence search strategies and retrieval ratios, assessed on a $10\%$ randomly split validation set.
  • Figure 3: Performance analysis on low-throughput experimental datasets. (a) Scatter plot of predicted fitness scores (by ProtREM) versus experimentally obtained EC50 values. For both alkali resistance and binding affinity improvements, ProtREM’s scoring of $31$ VHH antibody mutants by 1-4 sites shows a clear correlation with experimental data. (b) Performance of different models on the two assays of VHH antibody data. Only ProtREM successfully generated fitness scores that are moderately negatively correlated with EC50 values. (c) 3D structure of the template phi29 DNAP. The AA sites targeted for mutation across the 10 single-site mutants are highlighted and labeled with their wild-type residues. (d) Activity improvements in phi29 DNAP mutants. Among the $10$ single-site mutants experimentally tested, $8$ shows significant activity enhancements, with the top mutant exhibiting an $8$-fold increase. (e) Thermostability of phi29 DNAP mutants. Three mutants demonstrate improvements in both thermostability and activity, with two of them showing significant gains.
  • Figure S1: Spearman score difference between with and without retrieval by pLDDT.
  • Figure S2: Spearman score difference between with and without retrieval by taxon.
  • ...and 5 more figures