ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

Michal Kmicikiewicz; Vincent Fortuin; Ewa Szczurek

ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek

TL;DR

ProSpero tackles the challenge of designing protein sequences that are both highly fit and novel by framing iterative design as inference-time guidance of a pre-trained generative model guided by an uplifting surrogate. It introduces two key innovations: targeted masking to focus edits on fitness-relevant residues, and biologically-constrained Sequential Monte Carlo to sample from a posterior that incorporates explicit biological priors while remaining robust to surrogate misspecification. Through extensive experiments on eight protein landscapes, ProSpero achieves strong fitness and novelty, preserves biological plausibility, and demonstrates robustness under distribution shifts and surrogate noise, often outperforming a wide range of baselines. This data-efficient approach offers practical impact for protein engineering by enabling broader exploration without sacrificing plausibility, with public code available for replication and extension.

Abstract

Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.

ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

TL;DR

Abstract

ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)