Table of Contents
Fetching ...

Exploring zero-shot structure-based protein fitness prediction

Arnav Sharma, Anthony Gitter

TL;DR

It is found that predicted structures for disordered regions can be misleading and affect predictive performance, and an additional structure-based model on the ProteinGym substitution benchmark is evaluated, showing that simple multi-modal ensembles are strong baselines.

Abstract

The ability to make zero-shot predictions about the fitness consequences of protein sequence changes with pre-trained machine learning models enables many practical applications. Such models can be applied for downstream tasks like genetic variant interpretation and protein engineering without additional labeled data. The advent of capable protein structure prediction tools has led to the availability of orders of magnitude more precomputed predicted structures, giving rise to powerful structure-based fitness prediction models. Through our experiments, we assess several modeling choices for structure-based models and their effects on downstream fitness prediction. Zero-shot fitness prediction models can struggle to assess the fitness landscape within disordered regions of proteins, those that lack a fixed 3D structure. We confirm the importance of matching protein structures to fitness assays and find that predicted structures for disordered regions can be misleading and affect predictive performance. Lastly, we evaluate an additional structure-based model on the ProteinGym substitution benchmark and show that simple multi-modal ensembles are strong baselines.

Exploring zero-shot structure-based protein fitness prediction

TL;DR

It is found that predicted structures for disordered regions can be misleading and affect predictive performance, and an additional structure-based model on the ProteinGym substitution benchmark is evaluated, showing that simple multi-modal ensembles are strong baselines.

Abstract

The ability to make zero-shot predictions about the fitness consequences of protein sequence changes with pre-trained machine learning models enables many practical applications. Such models can be applied for downstream tasks like genetic variant interpretation and protein engineering without additional labeled data. The advent of capable protein structure prediction tools has led to the availability of orders of magnitude more precomputed predicted structures, giving rise to powerful structure-based fitness prediction models. Through our experiments, we assess several modeling choices for structure-based models and their effects on downstream fitness prediction. Zero-shot fitness prediction models can struggle to assess the fitness landscape within disordered regions of proteins, those that lack a fixed 3D structure. We confirm the importance of matching protein structures to fitness assays and find that predicted structures for disordered regions can be misleading and affect predictive performance. Lastly, we evaluate an additional structure-based model on the ProteinGym substitution benchmark and show that simple multi-modal ensembles are strong baselines.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Difference in Spearman correlation between predictions made using predicted and experimental structures. Each datapoint represents a DMS assay for which an experimental structure exists.
  • Figure 2: Protein-level Spearman $\rho$ predicting function of proteins comprising some form of disordered regions in the ProteinGym assay target sequence according to the DisProt database.
  • Figure 2: Spearman correlation averaged across UniProtID and then function type for disordered and ordered regions. These scores were computed across proteins that contain mutations in both disordered and ordered regions.
  • Figure 3: Predictions made by ESM2 650M on the P53_HUMAN_Giacomelli_2018_WT_Nutlin dataset separated by disordered and ordered regions.
  • Figure 4: Protein structure alignment between the experimental (orange, PDB 1XQ8 ulmer_structure_2005) and predicted (blue) structures for P37840 (human -synuclein). The ProteinGym AlphaFold 2 predicted structure resembles an -synuclein fibril, PDB 2N0A tuttle_solid-state_2016, which is a different conformation than the predicted -synuclein structure in the AlphaFold Protein Structure Database (AF-P37840-F1-v4) varadi_alphafold_2024.
  • ...and 3 more figures