Table of Contents
Fetching ...

Q-BIOLAT: Binary Latent Protein Fitness Landscapes for QUBO-Based Optimization

Truong-Son Hy

Abstract

Protein fitness optimization is inherently a discrete combinatorial problem, yet most learning-based approaches rely on continuous representations and are primarily evaluated through predictive accuracy. We introduce Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in compact binary latent spaces. Starting from pretrained protein language model embeddings, we construct binary latent representations and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions. Beyond its formulation, Q-BIOLAT provides a representation-centric perspective on protein fitness modeling. We show that representations with similar predictive performance can induce fundamentally different optimization landscapes. In particular, learned autoencoder-based representations collapse after binarization, producing degenerate latent spaces that fail to support combinatorial search, whereas simple structured representations such as PCA yield high-entropy, decodable, and optimization-friendly latent spaces. Across multiple datasets and data regimes, we demonstrate that classical combinatorial optimization methods, including simulated annealing, genetic algorithms, and greedy hill climbing, are highly effective in structured binary latent spaces. By expressing the objective in QUBO form, our approach connects modern machine learning with discrete and quantum-inspired optimization. Our implementation and dataset are publicly available at: https://github.com/HySonLab/Q-BIOLAT-Extended

Q-BIOLAT: Binary Latent Protein Fitness Landscapes for QUBO-Based Optimization

Abstract

Protein fitness optimization is inherently a discrete combinatorial problem, yet most learning-based approaches rely on continuous representations and are primarily evaluated through predictive accuracy. We introduce Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in compact binary latent spaces. Starting from pretrained protein language model embeddings, we construct binary latent representations and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions. Beyond its formulation, Q-BIOLAT provides a representation-centric perspective on protein fitness modeling. We show that representations with similar predictive performance can induce fundamentally different optimization landscapes. In particular, learned autoencoder-based representations collapse after binarization, producing degenerate latent spaces that fail to support combinatorial search, whereas simple structured representations such as PCA yield high-entropy, decodable, and optimization-friendly latent spaces. Across multiple datasets and data regimes, we demonstrate that classical combinatorial optimization methods, including simulated annealing, genetic algorithms, and greedy hill climbing, are highly effective in structured binary latent spaces. By expressing the objective in QUBO form, our approach connects modern machine learning with discrete and quantum-inspired optimization. Our implementation and dataset are publicly available at: https://github.com/HySonLab/Q-BIOLAT-Extended

Paper Structure

This paper contains 84 sections, 56 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Overview of the Q-BioLat framework. Protein sequences are first encoded using a pretrained protein language model (ESM) to obtain continuous embeddings. These embeddings are transformed into binary latent representations through projection and binarization, enabling protein fitness to be modeled as a quadratic unconstrained binary optimization (QUBO) problem. The resulting latent fitness landscape can be explored using combinatorial optimization methods and is directly compatible with quantum annealing hardware. Optimized latent codes are mapped back to high-fitness protein sequences.
  • Figure 2: Test-set performance of external sequence-level fitness oracles across data regimes on GFP and AAV. Each panel shows the Spearman correlation between predicted and ground-truth fitness as a function of the number of training samples. Across both datasets, Gaussian process regression often achieves the strongest performance in both low-data and moderate-data regimes, reflecting its ability to model uncertainty under limited supervision. As the dataset size increases, ridge regression becomes increasingly competitive and often achieves near the best performance, indicating that the fitness signal is largely captured by a linear model in the ESM embedding space. In contrast, XGBoost underperforms in the smallest-data regime, suggesting overfitting, but improves steadily with additional data. These results establish a reliable sequence-level oracle for evaluating generated protein sequences and highlight the importance of selecting appropriate models under different data regimes.
  • Figure 3: Scaling of decoding performance with dataset size for PCA-based and random-projection binary latent representations at 64 bits. The x-axis shows the number of training samples and the y-axis shows mutation F1. Across both GFP and AAV datasets, PCA consistently achieves higher decoding accuracy than random projection, and performance improves with increasing data.
  • Figure 4: End-to-end sequence design performance after optimization, decoding, and oracle scoring. For each dataset and training size, we plot the best-performing PCA-based and random-projection configuration. Each point is annotated with the corresponding latent dimension and optimizer. Across most settings, PCA-based binary latent representations achieve the strongest sequence-level performance, particularly at moderate and larger data regimes.
  • Figure 5: End-to-end design performance on GFP across different training sizes. The x-axis shows the number of bits and the y-axis shows the best oracle score. Each line corresponds to an optimizer-representation pair, including simulated annealing (SA), genetic algorithm (GA), random search (RS), and greedy hill climbing (GHC), combined with either PCA-based or random-projection binary latent representations.
  • ...and 1 more figures