Table of Contents
Fetching ...

Multi-Scale Representation Learning for Protein Fitness Prediction

Zuobai Zhang, Pascal Notin, Yining Huang, Aurélie Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang

TL;DR

The paper tackles zero-shot fitness prediction of protein mutations under limited labeled data. It presents S3F, a multi-scale framework that fuses sequence language-model embeddings, a GVP-based structure encoder, and a surface-point encoder to predict mutational effects, pre-trained via residue-type masking on the CATH dataset. On ProteinGym, S3F (especially with MSA augmentation as S3F-MSA) achieves state-of-the-art Spearman performance while remaining parameter-efficient, and ablation analyses show consistent gains from adding structure and surface information. This work demonstrates that integrating sequence, structure, and surface representations improves predictive power, aids in capturing epistasis, and generalizes to unseen protein families, with practical implications for protein design and engineering.

Abstract

Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at https://github.com/DeepGraphLearning/S3F.

Multi-Scale Representation Learning for Protein Fitness Prediction

TL;DR

The paper tackles zero-shot fitness prediction of protein mutations under limited labeled data. It presents S3F, a multi-scale framework that fuses sequence language-model embeddings, a GVP-based structure encoder, and a surface-point encoder to predict mutational effects, pre-trained via residue-type masking on the CATH dataset. On ProteinGym, S3F (especially with MSA augmentation as S3F-MSA) achieves state-of-the-art Spearman performance while remaining parameter-efficient, and ablation analyses show consistent gains from adding structure and surface information. This work demonstrates that integrating sequence, structure, and surface representations improves predictive power, aids in capturing epistasis, and generalizes to unseen protein families, with practical implications for protein design and engineering.

Abstract

Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at https://github.com/DeepGraphLearning/S3F.

Paper Structure

This paper contains 21 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Multi-scale Pre-training and Inference Frameworks for Protein Fitness Prediction. During pre-training, protein sequences and structures are sampled from a database, with 15% of residue types randomly masked. These sequences are fed into a protein language model, ESM-2-650M. Then, the output residue representations are used to initialize node features in our structure and surface encoders. Through message passing on structure and surface graphs, our methods, S2F (blue) and S3F (green), accurately predict the residue type distribution at each masked position. This distribution is subsequently used for mutation preferences in downstream fitness prediction tasks.
  • Figure 2: Results of ESM-2-650M, S2F, S3F, and S3F-MSA for Analyzing Contributions of Sequences, Structures, Surfaces, and Alignments.(a-d) Breakdown performance (Spearman's rank correlation) on assays grouped by function type (a), MSA depth (b), taxon (c), and mutation depth (d). (e-f) Impact of protein structure quality on performance. (e) Breakdown performance on assays with low, medium, and high-quality structures. (f) Results using five groups of AlphaFold2-predicted structures ranked by pLDDT (0 for the highest pLDDT, 4 for the lowest pLDDT). (g) Results on all assays and out-of-distribution assays with low sequence similarity to the pre-training dataset.
  • Figure 3: Case Study on GB1.(a-c) For each pair of mutation sites, we plot the Spearman's rank correlation between the experimental values and model-predicted scores for all 361 mutations: ESM (a), S2F (b), and S3F (c). The epistasis between residues 234-252 and residues 266-282 (in the black rectangle) are better captured by S2F and S3F. (d) Visualization of the predicted structure for GB1. Mutation regions 234-252 and 266-282 are highlighted in red and blue, respectively.
  • Figure 4: Spearmanr's rank correlation for ESM-2-650M, S2F, S3F and S3F-MSA on 19 proteins with less than 30% sequence similarity to the pre-training dataset.