Table of Contents
Fetching ...

A Variational Perspective on Generative Protein Fitness Optimization

Lea Bogensperger, Dominik Narnhofer, Ahmed Allam, Konrad Schindler, Michael Krauthammer

TL;DR

Protein fitness optimization is framed as sampling from the posterior $p(x|y)$ over a vast, discrete sequence space, where $y=f(x)$ is the target fitness. The paper introduces Variational Latent Generative Protein Optimization (VLGPO), embedding sequences into a continuous latent space with a flow-matching prior and guiding sampling via classifier guidance using $\nabla_x \log p(y|x)$ to reach high-fitness regions. The approach combines a latent variational autoencoder ($\mathcal{E},\mathcal{D}$) with latent-space flow modeling and manifold-constrained gradients, achieving state-of-the-art results on GFP and AAV benchmarks under limited data regimes and offering a flexible plug-and-play framework. While modular and effective, the method relies on hyperparameter tuning and in-silico evaluation with an external oracle, suggesting future work on pretrained embeddings, broader benchmarks, and experimental validation.

Abstract

The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.

A Variational Perspective on Generative Protein Fitness Optimization

TL;DR

Protein fitness optimization is framed as sampling from the posterior over a vast, discrete sequence space, where is the target fitness. The paper introduces Variational Latent Generative Protein Optimization (VLGPO), embedding sequences into a continuous latent space with a flow-matching prior and guiding sampling via classifier guidance using to reach high-fitness regions. The approach combines a latent variational autoencoder () with latent-space flow modeling and manifold-constrained gradients, achieving state-of-the-art results on GFP and AAV benchmarks under limited data regimes and offering a flexible plug-and-play framework. While modular and effective, the method relies on hyperparameter tuning and in-silico evaluation with an external oracle, suggesting future work on pretrained embeddings, broader benchmarks, and experimental validation.

Abstract

The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.

Paper Structure

This paper contains 23 sections, 9 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of sampling. The central section illustrates the framework, showcasing protein sequences, their latent representations $z$, and the approximate posterior distribution. While the upper section depicts unconditional sampling from the prior $\mathrm{p}(x)$ using flow matching in the latent space, the lower section illustrates the modifications introduced by during sampling. We additionally incorporate a likelihood term $\mathrm{p}(y|x)$ to condition on the fitness $y$, enabling sequence generation from the posterior distribution $\mathrm{p}(x|y)$ and facilitating sampling from high-fitness regions (as shown by the shifted and reshaped distribution).
  • Figure 2: Schematic depiction of classifier guidance, with $J=1$ and $K=6$. Grey lines represent the latent manifolds at different time steps $t$, the blue line marks the trajectory of the maximum likelihood. Solid arrows indicate how the latent evolves over time. Left: Naive guidance with likelihood gradients $\nabla_{z_t}$ computed directly at $z_t$ pushes the sample off the manifold. This error accumulates, as indicated by the purple regions. Right: Guidance with manifold constraint, as employed in (\ref{['alg:vgpo']}), converges to a valid sequence with fitness $y$. Solid arrows again denote the evolution of the latent, dashed arrows indicate the flow posterior sampling scheme that ensures the latent stays on the manifold when applying the likelihood gradient.
  • Figure 3: Grid search for median fitness depending on sampling parameters $\alpha_t$ and $J$ for the different tasks using the predictor $g_{\phi}$. In general, higher values of $\alpha_t$ and $J$, corresponding to strong classifier guidance, yield higher predicted fitness values.
  • Figure 4: Comparing evaluated fitness $y_{\mathrm{gt}}$ from the oracle $g_\psi$ with required fitness $y$ using the directly learned posterior model (in the same latent space) and our variational approach .
  • Figure 5: AAV medium
  • ...and 2 more figures