Table of Contents
Fetching ...

Robust Model-Based Optimization for Challenging Fitness Landscapes

Saba Ghaffari, Ehsan Saleh, Alexander G. Schwing, Yu-Xiong Wang, Martin D. Burke, Saurabh Sinha

TL;DR

This work tackles the robustness gap in model-based optimization for protein design caused by sparse high-fitness samples and the separation between well-represented low-fitness regions and distant high-fitness optima. It introduces the Property-Prioritized Generative Variational Auto-Encoder (PPGVAE), which reshapes the latent space via a relationship loss and a temperature parameter to prioritize high-fitness samples without weighting, achieving $N_{\\text{eff}} = K$. The method demonstrates superior robustness across discrete and continuous design spaces through benchmarks on GMM, real protein datasets (AAV, GB1, PhoQ), semi-synthetic landscapes, and PINN-based Poisson equation solutions, often converging faster and with fewer MBO steps. The results suggest broad applicability and practical impact for efficient design under challenging data distributions and rugged fitness landscapes.

Abstract

Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.

Robust Model-Based Optimization for Challenging Fitness Landscapes

TL;DR

This work tackles the robustness gap in model-based optimization for protein design caused by sparse high-fitness samples and the separation between well-represented low-fitness regions and distant high-fitness optima. It introduces the Property-Prioritized Generative Variational Auto-Encoder (PPGVAE), which reshapes the latent space via a relationship loss and a temperature parameter to prioritize high-fitness samples without weighting, achieving . The method demonstrates superior robustness across discrete and continuous design spaces through benchmarks on GMM, real protein datasets (AAV, GB1, PhoQ), semi-synthetic landscapes, and PINN-based Poisson equation solutions, often converging faster and with fewer MBO steps. The results suggest broad applicability and practical impact for efficient design under challenging data distributions and rugged fitness landscapes.

Abstract

Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
Paper Structure (21 sections, 8 equations, 20 figures, 3 tables)

This paper contains 21 sections, 8 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Challenges of imbalance and separation in fitness landscape. Each plot shows a sequence space (x-y plane) and fitness landscape (red-white-blue gradient), along with training data composition (white circles and stars). (A-C, left to right) In each of these hypothetical scenarios, sparsity of high-fitness training samples (white stars) relative to low-fitness samples (white circles), also called "imbalance" presents a challenge for MBO. Moreover, panel C shows a greater degree of separation between low- and high-fitness samples, compared to B and A, presenting significant additional challenge for MBO, above and beyond that due to imbalance. The rightmost panel is the schematic representation of real-world dataset of enzyme variants designed for an unnatural substrate (xyz) distinct from the substrate of the wild-type enzyme (xyz). The dataset comprises a few non-zero fitness variants (stars) that are far from the bulk of training samples, which have zero fitness (white circles). Hypothetical peaks have been drawn at the rare non-zero fitness variants, to illustrate that the fitness landscape presents the twin challenges of imbalance and separation, similar to that in panel C.
  • Figure 2: Latent space of our PPGVAE vs Vanilla VAE. PPGVAE and vanilla VAE were trained on a toy MNIST-derived dataset where property values decrease monotonically with digit value (zero has highest property value). Vanilla VAE (Left) scatters the rare samples of digit zero (blue) and samples of next-highest property value (digit one, orange) in the latent space, whereas PPGVAE (Middle and Right) maps digits with higher property values closer to the origin. This results in the classes with greatest property values having higher probability of generation. PPGVAE was run in two modes, where the relationship loss was enforced in a strong (Middle) or soft (Right) manner (see text).
  • Figure 3: Robustness to imbalance and separation in MBO for GMM. A bimodal GMM is used as the property oracle (top Left), i.e., the fitness ($Y$) landscape on a one-dimensional sequence space ($X$). Separation is defined as the distance between the means of the two modes ($\Delta \mu$). Higher values of $\Delta \mu$ are associated with higher separation. Train sets were generated by taking $N$ samples from the less desired mode $\mu_1$ and $\rho N$ (imbalance ratio $\rho \leq 1$) samples from the more desired mode $\mu_2$. For a fixed separation, PPGVAE achieves robust relative improvement of the highest property sample generated ($\Delta Y_{\max}$), regardless of the imbalance ratio (Bottom panels). Performance of PPGVAE, aggregated over all imbalance ratios, stays robust to increasing separation (top Right). PPGVAE converges in less number of MBO steps (top Middle).
  • Figure 4: Robustness to imbalance and separation in MBO for AAV dataset. PCA plot for protein sequences in the dataset, colored with their property values (top Left). Blue and red color spectrum are used for less and more desired samples, respectively. Top middle and right panels show train sets with low and high separation, respectively, between the abundant less-desired and rare more-desired samples. PPGVAE achieves robust relative improvements (shown here for the low separation scenario), regardless of the imbalance ratio $\rho$ (bottom Middle). Its performance also stays robust to increasing separation (bottom Right). PPGVAE performance is only slightly affected by reducing its sampling budget per MBO step ($N_s$) (bottom Left).
  • Figure 5: Robustness to imbalance and separation in MBO for GB1 dataset. The tSNE plot for the appended sequences of semi-synthetic GB1 dataset (top Left). Bottom left panel represents an example of train set for low separation between less and more desired samples, i.e., appended sequence of length three (see Figure \ref{['fig:app_gb_lowhigh']} for an example of high separation). For a fixed separation level, PPGVAE provides robust improvements, regardless of the imbalance ratio (top Middle). It is also robust to the degree of separation, measured by aggregated performance over all imbalance ratios (top Right). PPGVAE has faster convergence (bottom Right) and achieves similar improvements with less sampling budget per MBO step ($N_s$) (bottom Middle).
  • ...and 15 more figures