Table of Contents
Fetching ...

Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training

Felix Teufel, Aaron W. Kollasch, Yining Huang, Ole Winther, Kevin K. Yang, Pascal Notin, Debora S. Marks

TL;DR

Protein fitness prediction under tight data budgets is addressed by PRIMO, a transformer that blends in-context learning with test-time training to rapidly adapt to new proteins and assays. PRIMO encodes sequences, zero-shot scores, and sparse labels as a unified token set and optimizes with a preference-based ranking loss, enabling effective few-shot ranking for substitutions and indels. Pre-training across many DMS assays and subsequent test-time adaptation yield state-of-the-art performance in low-data regimes and across diverse protein families, as demonstrated on held-out ProteinGym assays and a natural-evolution benchmark. The work demonstrates the practical value of large-scale pre-training combined with lightweight, task-specific adaptation for data-efficient protein design.

Abstract

Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.

Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training

TL;DR

Protein fitness prediction under tight data budgets is addressed by PRIMO, a transformer that blends in-context learning with test-time training to rapidly adapt to new proteins and assays. PRIMO encodes sequences, zero-shot scores, and sparse labels as a unified token set and optimizes with a preference-based ranking loss, enabling effective few-shot ranking for substitutions and indels. Pre-training across many DMS assays and subsequent test-time adaptation yield state-of-the-art performance in low-data regimes and across diverse protein families, as demonstrated on held-out ProteinGym assays and a natural-evolution benchmark. The work demonstrates the practical value of large-scale pre-training combined with lightweight, task-specific adaptation for data-efficient protein design.

Abstract

Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.

Paper Structure

This paper contains 30 sections, 5 equations, 1 figure, 20 tables.

Figures (1)

  • Figure 1: The PRIMO architecture and training approach. PRIMO processes labeled sets of proteins drawn from ProteinGym DMS assays. After processing the set with a transformer stack that allows for exchange of information between samples, it performs preference prediction on samples with masked fitness, and masked token prediction on amino acids.