Table of Contents
Fetching ...

Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization

Truong-Son Hy

Abstract

We propose Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms. On the ProteinGym benchmark, we demonstrate that Q-BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high-fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher-dimensional latent spaces and local search remaining competitive in preserving realistic sequences. Beyond its empirical performance, Q-BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum-assisted protein engineering. Our implementation is publicly available at: https://github.com/HySonLab/Q-BIOLAT

Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization

Abstract

We propose Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms. On the ProteinGym benchmark, we demonstrate that Q-BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high-fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher-dimensional latent spaces and local search remaining competitive in preserving realistic sequences. Beyond its empirical performance, Q-BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum-assisted protein engineering. Our implementation is publicly available at: https://github.com/HySonLab/Q-BIOLAT
Paper Structure (33 sections, 12 equations, 2 figures, 3 tables)

This paper contains 33 sections, 12 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the Q-BioLat framework. Protein sequences are first encoded using a pretrained protein language model (ESM) to obtain continuous embeddings. These embeddings are transformed into binary latent representations through projection and binarization, enabling protein fitness to be modeled as a quadratic unconstrained binary optimization (QUBO) problem. The resulting latent fitness landscape can be explored using combinatorial optimization methods and is directly compatible with quantum annealing hardware. Optimized latent codes are mapped back to high-fitness protein sequences.
  • Figure 2: Effect of latent dimension on optimization and surrogate performance on the GFP benchmark. Each curve corresponds to a different dataset size. Moderate latent dimensions (e.g., 16–32) provide a favorable trade-off between predictive accuracy and optimization stability.