Table of Contents
Fetching ...

ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek

TL;DR

ProSpero tackles the challenge of designing protein sequences that are both highly fit and novel by framing iterative design as inference-time guidance of a pre-trained generative model guided by an uplifting surrogate. It introduces two key innovations: targeted masking to focus edits on fitness-relevant residues, and biologically-constrained Sequential Monte Carlo to sample from a posterior that incorporates explicit biological priors while remaining robust to surrogate misspecification. Through extensive experiments on eight protein landscapes, ProSpero achieves strong fitness and novelty, preserves biological plausibility, and demonstrates robustness under distribution shifts and surrogate noise, often outperforming a wide range of baselines. This data-efficient approach offers practical impact for protein engineering by enabling broader exploration without sacrificing plausibility, with public code available for replication and extension.

Abstract

Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.

ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

TL;DR

ProSpero tackles the challenge of designing protein sequences that are both highly fit and novel by framing iterative design as inference-time guidance of a pre-trained generative model guided by an uplifting surrogate. It introduces two key innovations: targeted masking to focus edits on fitness-relevant residues, and biologically-constrained Sequential Monte Carlo to sample from a posterior that incorporates explicit biological priors while remaining robust to surrogate misspecification. Through extensive experiments on eight protein landscapes, ProSpero achieves strong fitness and novelty, preserves biological plausibility, and demonstrates robustness under distribution shifts and surrogate noise, often outperforming a wide range of baselines. This data-efficient approach offers practical impact for protein engineering by enabling broader exploration without sacrificing plausibility, with public code available for replication and extension.

Abstract

Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.

Paper Structure

This paper contains 62 sections, 2 equations, 4 figures, 19 tables, 4 algorithms.

Figures (4)

  • Figure 1: Overview of ProSpero. Each active learning iteration begins with training a surrogate model on the current dataset (A). The surrogate is then used to identify fitness-relevant residues within the top sequence (B), which are subsequently masked, yielding partially masked sequences (C). EvoDiff, guided by the surrogate, completes these sequences to generate new candidates (D), which are evaluated by the oracle and added to the dataset (E).
  • Figure 2: Maximum fitness recovered over $10$ active learning rounds. Only methods that improved over $x_{\text{start}}$ are shown. Shaded regions indicate standard deviation across 5 runs. ProSpero retrieves high-fitness sequences in earlier rounds than baselines on 4 out of 8 tasks.
  • Figure 3: Comparison of fitness-novelty trade-offs among leading methods. Each dot represents the outcome of a single run. ProSpero achieves both higher fitness and novelty than the baselines.
  • Figure 4: (A) Trade-offs between validity, fitness and novelty across leading methods; each dot represents the outcome of a single run. (B) Structural quality of top 100 sequences generated by ProSpero across 5 runs compared to $x_{\text{start}}$; average novelty of generated sequences is shown below each task. (C) Performance of leading methods on the AAV landscape under varying levels of surrogate noise. (D) Ablation of ProSpero components under the same setting as in (C). In both (C) and (D), shaded regions represent the standard deviation across 5 runs. ProSpero generates highly biologically plausible sequences and remains robust to surrogate misspecification.