Table of Contents
Fetching ...

Steering Generative Models with Experimental Data for Protein Fitness Optimization

Jason Yang, Wenda Chu, Daniel Khalil, Raul Astudillo, Bruce J. Wittmann, Frances H. Arnold, Yisong Yue

TL;DR

The paper addresses efficient protein fitness optimization in enormous design spaces under low-throughput lab constraints by proposing SGPO, a general framework that steers modern generative models with small sets of labeled data. It systematically evaluates a spectrum of priors (discrete diffusion, continuous diffusion over embeddings, and autoregressive models) and guidance strategies (classifier guidance, DAPS, NOS) and demonstrates adaptive sequence selection via ensemble Thompson sampling. Key findings show that plug-and-play guidance with discrete diffusion models, particularly DAPS, outperforms RL-based finetuning and latent-space Bayesian optimization in low-data regimes, while Thompson sampling-based adaptation improves search efficiency and variant diversity across TrpB, CreiLOV, and GB1. The work provides practical guidance for real-world protein engineering, including a lightweight hyperparameter tuning approach and publicly available code to foster adoption and further validation.

Abstract

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.

Steering Generative Models with Experimental Data for Protein Fitness Optimization

TL;DR

The paper addresses efficient protein fitness optimization in enormous design spaces under low-throughput lab constraints by proposing SGPO, a general framework that steers modern generative models with small sets of labeled data. It systematically evaluates a spectrum of priors (discrete diffusion, continuous diffusion over embeddings, and autoregressive models) and guidance strategies (classifier guidance, DAPS, NOS) and demonstrates adaptive sequence selection via ensemble Thompson sampling. Key findings show that plug-and-play guidance with discrete diffusion models, particularly DAPS, outperforms RL-based finetuning and latent-space Bayesian optimization in low-data regimes, while Thompson sampling-based adaptation improves search efficiency and variant diversity across TrpB, CreiLOV, and GB1. The work provides practical guidance for real-world protein engineering, including a lightweight hyperparameter tuning approach and publicly available code to foster adoption and further validation.

Abstract

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.

Paper Structure

This paper contains 37 sections, 18 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of steered generation for protein optimization (SGPO) to other ML-assisted workflows for protein engineering.(A) SGPO involves initializing a generative prior model to sample sequences with high natural likelihoods and steering that model with assay-labeled fitness data. Optimization is difficult because the design space is vast, and the throughput of wet-lab fitness assays (Erlenmeyer flask icon) is low, so adaptive learning across multiple iterations is beneficial. Previous methods have utilized generative models such as (B) fully zero-shot methods that sample highly natural sequences but do not utilize labeled fitness data or (C) those that only utilize labeled fitness. (D) Alternatively, supervised approaches involve enumerating to calculate fitness predictions for all variants in a design space, limiting them to optimizing few residues (i.e., $N < 9$).
  • Figure 2: Overview of different approaches to train diffusion models over discrete state spaces. During inference, a noised latent representation or sequence is decoded into a reasonable sequence (bottom track for each method). [X] refers to a masked token.
  • Figure 3: Methods design space for SGPO: a non-exhaustive landscape of generative models for protein sequences and methods to steer them with labeled data. Three major types of diffusion models for sequences include those that perform diffusion over continuous space and those that perform diffusion over discrete space with a uniform or absorbing state (masking) noising process. Various types of guidance strategies are compatible with certain models, in green (NOS: diffusion optimization sampling, SMC: sequential monte carlo, FUDGE: future discriminators for generation, PPLM: plug and play language models, DDPP: discrete denoising posterior prediction, RTB: relative trajectory balance, DPLM: diffusion protein language model, BO: Bayesian optimization). Differently, language models and variational autoencoders can be aligned with labeled data via reinforcement learning such as policy optimization or supervised finetuning.
  • Figure 4: Pretrained generative priors capture the target distribution of naturally occurring sequences that are homologous to TrpB (A-B) and CreiLOV (C-D), respectively. Lower perplexity corresponds to higher likelihood in the model. The diversity of sequences was computed as the average Shannon entropy of mutated positions with mean fitness corresponding to the oracle predictions. While the various models largely achieve comparable performance, the D3PM models capture the target distribution with the highest fidelity, whereas the UDLM model is prone to mode collapse. For each model, 1000 sequences were sampled and repeats were allowed to approximate the distribution. To approximate the target distribution, 1000 sequences were sampled from the MSA used for pretraining. Perplexity was calculated by passing generated sequences through the 764 M parameter ProGen2-base model. More details on model training can be found in Table \ref{['table:prior_training']} and Section \ref{['section:model_methods']}, and GB1 results are provided in Fig. \ref{['fig:GB1_perplexity_logoplot']}.
  • Figure 5: Pareto boundaries demonstrate the trade-off between generating sequences with high fitness and high diversity for TrpB (A-C), CreiLOV (D-F), and GB1 (G-I). Sequences sampled from the generative models (Continuous, D3PM, and MDLM), after guidance with labeled fitness data, are enriched in high-fitness protein variants, and most methods show higher performance than the ARLM+DPO baseline. Larger circle indicates a stronger guidance strength hyperparameter (excluding NOS), specified in Table \ref{['table:guidance_parameters']}. Each experiment was repeated using 10 different standardized sets of 200 unique sequences used for steering, each drawn from the D3PM prior, and error bars show standard deviation. Mean fitness and diversity were calculated based on 200 generated samples, with diversity calculated as the average Shannon entropy of amino acids at mutated positions. Unconditional refers to sequences sampled from the prior with no guidance.
  • ...and 8 more figures