Table of Contents
Fetching ...

Exploring Large Protein Language Models in Constrained Evaluation Scenarios within the FLIP Benchmark

Manuel F. Mollon, Joaquin Gonzalez-Rodriguez, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano

TL;DR

This work evaluates large protein language modeling approaches under FLIP's constrained data regime, addressing protein fitness prediction with limited task-specific data. It extends FLIP with ESM-2 variants and SaProt, employing an embedding-based training head and evaluating via Mean Squared Error (MSE) and Spearman rank correlation $\rho$ across AAV, Meltome, and GB1 splits that emphasize low-mutation training and high-mutation testing. Key findings show that deeper configurations (e.g., 33- or 48-layer ESM-2) often improve generalization, while SaProt can exhibit strong training performance but risks overfitting; structure-aware inputs generally enhance robustness, albeit with higher computational costs. The results inform practical model selection for data-scarce protein engineering tasks and highlight directions for mitigating overfitting and understanding generalization across mutation landscapes.

Abstract

In this study, we expand upon the FLIP benchmark-designed for evaluating protein fitness prediction models in small, specialized prediction tasks-by assessing the performance of state-of-the-art large protein language models, including ESM-2 and SaProt on the FLIP dataset. Unlike larger, more diverse benchmarks such as ProteinGym, which cover a broad spectrum of tasks, FLIP focuses on constrained settings where data availability is limited. This makes it an ideal framework to evaluate model performance in scenarios with scarce task-specific data. We investigate whether recent advances in protein language models lead to significant improvements in such settings. Our findings provide valuable insights into the performance of large-scale models in specialized protein prediction tasks.

Exploring Large Protein Language Models in Constrained Evaluation Scenarios within the FLIP Benchmark

TL;DR

This work evaluates large protein language modeling approaches under FLIP's constrained data regime, addressing protein fitness prediction with limited task-specific data. It extends FLIP with ESM-2 variants and SaProt, employing an embedding-based training head and evaluating via Mean Squared Error (MSE) and Spearman rank correlation across AAV, Meltome, and GB1 splits that emphasize low-mutation training and high-mutation testing. Key findings show that deeper configurations (e.g., 33- or 48-layer ESM-2) often improve generalization, while SaProt can exhibit strong training performance but risks overfitting; structure-aware inputs generally enhance robustness, albeit with higher computational costs. The results inform practical model selection for data-scarce protein engineering tasks and highlight directions for mitigating overfitting and understanding generalization across mutation landscapes.

Abstract

In this study, we expand upon the FLIP benchmark-designed for evaluating protein fitness prediction models in small, specialized prediction tasks-by assessing the performance of state-of-the-art large protein language models, including ESM-2 and SaProt on the FLIP dataset. Unlike larger, more diverse benchmarks such as ProteinGym, which cover a broad spectrum of tasks, FLIP focuses on constrained settings where data availability is limited. This makes it an ideal framework to evaluate model performance in scenarios with scarce task-specific data. We investigate whether recent advances in protein language models lead to significant improvements in such settings. Our findings provide valuable insights into the performance of large-scale models in specialized protein prediction tasks.

Paper Structure

This paper contains 22 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Model that process embeddings
  • Figure 2: Pipeline implementation for SaProt and ESM
  • Figure 3: Model comparison by split for both MSE and $\rho$ metrics (GB1)
  • Figure 4: Violin plots for metrics across splits and models for the Meltome dataset.
  • Figure 5: Violin plots for metrics across splits and models for the GB1 dataset.
  • ...and 1 more figures