A Variational Perspective on Generative Protein Fitness Optimization
Lea Bogensperger, Dominik Narnhofer, Ahmed Allam, Konrad Schindler, Michael Krauthammer
TL;DR
Protein fitness optimization is framed as sampling from the posterior $p(x|y)$ over a vast, discrete sequence space, where $y=f(x)$ is the target fitness. The paper introduces Variational Latent Generative Protein Optimization (VLGPO), embedding sequences into a continuous latent space with a flow-matching prior and guiding sampling via classifier guidance using $\nabla_x \log p(y|x)$ to reach high-fitness regions. The approach combines a latent variational autoencoder ($\mathcal{E},\mathcal{D}$) with latent-space flow modeling and manifold-constrained gradients, achieving state-of-the-art results on GFP and AAV benchmarks under limited data regimes and offering a flexible plug-and-play framework. While modular and effective, the method relies on hyperparameter tuning and in-silico evaluation with an external oracle, suggesting future work on pretrained embeddings, broader benchmarks, and experimental validation.
Abstract
The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.
