ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods
Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek
TL;DR
ProSpero tackles the challenge of designing protein sequences that are both highly fit and novel by framing iterative design as inference-time guidance of a pre-trained generative model guided by an uplifting surrogate. It introduces two key innovations: targeted masking to focus edits on fitness-relevant residues, and biologically-constrained Sequential Monte Carlo to sample from a posterior that incorporates explicit biological priors while remaining robust to surrogate misspecification. Through extensive experiments on eight protein landscapes, ProSpero achieves strong fitness and novelty, preserves biological plausibility, and demonstrates robustness under distribution shifts and surrogate noise, often outperforming a wide range of baselines. This data-efficient approach offers practical impact for protein engineering by enabling broader exploration without sacrificing plausibility, with public code available for replication and extension.
Abstract
Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.
