Steering Generative Models with Experimental Data for Protein Fitness Optimization
Jason Yang, Wenda Chu, Daniel Khalil, Raul Astudillo, Bruce J. Wittmann, Frances H. Arnold, Yisong Yue
TL;DR
The paper addresses efficient protein fitness optimization in enormous design spaces under low-throughput lab constraints by proposing SGPO, a general framework that steers modern generative models with small sets of labeled data. It systematically evaluates a spectrum of priors (discrete diffusion, continuous diffusion over embeddings, and autoregressive models) and guidance strategies (classifier guidance, DAPS, NOS) and demonstrates adaptive sequence selection via ensemble Thompson sampling. Key findings show that plug-and-play guidance with discrete diffusion models, particularly DAPS, outperforms RL-based finetuning and latent-space Bayesian optimization in low-data regimes, while Thompson sampling-based adaptation improves search efficiency and variant diversity across TrpB, CreiLOV, and GB1. The work provides practical guidance for real-world protein engineering, including a lightweight hyperparameter tuning approach and publicly available code to foster adoption and further validation.
Abstract
Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.
