Robust Model-Based Optimization for Challenging Fitness Landscapes
Saba Ghaffari, Ehsan Saleh, Alexander G. Schwing, Yu-Xiong Wang, Martin D. Burke, Saurabh Sinha
TL;DR
This work tackles the robustness gap in model-based optimization for protein design caused by sparse high-fitness samples and the separation between well-represented low-fitness regions and distant high-fitness optima. It introduces the Property-Prioritized Generative Variational Auto-Encoder (PPGVAE), which reshapes the latent space via a relationship loss and a temperature parameter to prioritize high-fitness samples without weighting, achieving $N_{\\text{eff}} = K$. The method demonstrates superior robustness across discrete and continuous design spaces through benchmarks on GMM, real protein datasets (AAV, GB1, PhoQ), semi-synthetic landscapes, and PINN-based Poisson equation solutions, often converging faster and with fewer MBO steps. The results suggest broad applicability and practical impact for efficient design under challenging data distributions and rugged fitness landscapes.
Abstract
Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
