Table of Contents
Fetching ...

Low-N Protein Activity Optimization with FolDE

Jacob B. Roberts, Catherine R. Ji, Isaac Donnell, Thomas D. Young, Allison N. Pearson, Graham A. Hudson, Leah S. Keiser, Mia Wesselkamper, Peter H. Winegar, Janik Ludwig, Sarah H. Klass, Isha V. Sheth, Ezechinyere C. Ukabiala, Maria C. T. Astolfi, Benjamin Eysenbach, Jay D. Keasling

TL;DR

Low-N protein optimization faces data-bias and lack of exploration when using purely top-ranked mutants. FolDE introduces naturalness warm-start and diversity-aware batch selection (constant-liar) to iteratively refine protein activity predictions, combining PLM embeddings with ranking-based neural learning. In ProteinGym-based simulations across 20 targets, FolDE achieves a 23% gain in top-10% mutants and a 55% higher chance of finding a top-1% mutant, outperforming random, zero-shot, and EVOLVEpro baselines, with open-source Foldy software enabling broader adoption. The results demonstrate that integrating naturalness priors, robust ranking, and batch diversity in ALDE workflows can dramatically improve efficiency and accuracy in low-N protein engineering, with implications for foundation-model–driven design in biology.

Abstract

Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.

Low-N Protein Activity Optimization with FolDE

TL;DR

Low-N protein optimization faces data-bias and lack of exploration when using purely top-ranked mutants. FolDE introduces naturalness warm-start and diversity-aware batch selection (constant-liar) to iteratively refine protein activity predictions, combining PLM embeddings with ranking-based neural learning. In ProteinGym-based simulations across 20 targets, FolDE achieves a 23% gain in top-10% mutants and a 55% higher chance of finding a top-1% mutant, outperforming random, zero-shot, and EVOLVEpro baselines, with open-source Foldy software enabling broader adoption. The results demonstrate that integrating naturalness priors, robust ranking, and batch diversity in ALDE workflows can dramatically improve efficiency and accuracy in low-N protein engineering, with implications for foundation-model–driven design in biology.

Abstract

Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: The FolDE Workflow (a) schematic of the FolDE workflow, starting with the zero-shot prediction before data has been collected, followed by few-shot prediction and Bayesian batch building, in a design-build-test-learn cycle. (b) Performance on the single-mutation and (c) multi-mutation benchmark for FolDE vs three baselines: random selection, zero-shot naturalness-based selection, and a random forest with embeddings (representing EVOLVEpro, Jiang2025-pb). Metrics shown are the cumulative top 10% mutants discovered (top) and probability of finding a top 1% mutant (bottom).
  • Figure 2: The Apparent Tension Between Round-1 and Round-2 Explore and Exploit (a) The simple workflow under study, notably excluding some FolDE features like ensembling and constant-liar. We study the interplay between three features: choice of top-layer (random forest or a neural network), round-1 selection approach (random selection or naturalness zero-shot selection), and the inclusion of naturalness warm-start when training. (b) training benchmark experiment results: the Spearman correlation of the trained model on a held-out set of mutants for three rounds of simulation for a random forest top-layer and (c) for a neural network top-layer. (d) Performance of the FolDE workflow without the warm-start feature enabled, measured on the single-mutation benchmark and (e) multi-mutation benchmark.
  • Figure 3: Constant-Liar Improves Batch Diversity (a) Schematic of the constant-liar algorithm. After selecting a high-performing mutant from the pool, the algorithm pessimistically assumes that mutant performs poorly (the "lie"), propagating this assumption through the prediction ensemble's covariance structure. The alpha parameter controls the balance between exploitation and exploration, with lower values creating more diverse batches. (b) Batch diversity for the single-mutation and (c) multi-mutation datasets with constant-liar applied for six rounds. Bars indicate the number of unique loci sampled per batch, with darker colors showing newly explored loci. (d) Model predictions are more accurate with more aggressive constant-liar on the single-mutation and (e) multi-mutation benchmarks. (f) Medium constant-liar ($\alpha=6$) applied in round-2 slightly improves the probability of finding a top 1% mutant in the single mutation benchmark. (g) Medium constant-liar ($\alpha=6$) in round-2 has little effect on the probability of finding a top 1% mutant in the multi-mutation benchmark.
  • Figure 4: FolDE Ablation (a) The relative contribution of the major workflow components were evaluated by the number of top 10% mutants discovered after a 3 round campaign on both the single-mutation benchmark and (b) the multi-mutation benchmark. (c, d) 3 round probability of finding a top 1% mutation.
  • Figure S1: Top Layer Architectures Prediction quality of three top-layer architectures: random forest, a neural network trained with mean squared error loss, and a neural network trained with ranking loss. Evaluated on the training benchmark. All have random mutants selected in round-1.
  • ...and 5 more figures