Large Language Model is Secretly a Protein Sequence Optimizer

Yinkai Wang; Jiaxing He; Yuanqi Du; Xiaohui Chen; Jianan Canal Li; Li-Ping Liu; Xiaolin Xu; Soha Hassoun

Large Language Model is Secretly a Protein Sequence Optimizer

Yinkai Wang, Jiaxing He, Yuanqi Du, Xiaohui Chen, Jianan Canal Li, Li-Ping Liu, Xiaolin Xu, Soha Hassoun

TL;DR

This work shows that pre-trained large language models can act as on-the-fly protein sequence optimizers within a directed-evolution framework, enabling Pareto- and budget-constrained search for high-fitness variants without additional fine-tuning. By sampling from the LLM and guiding mutation and crossover, the approach outperforms traditional evolutionary algorithms on complex landscapes and under practical experimental constraints. The study validates across exact, synthetic, and ML-based fitness landscapes, and across single- and multi-objective tasks, demonstrating the practicality of LLM-guided protein engineering. The results suggest a promising path toward integrating LLM-driven optimization into real-world directed evolution workflows to accelerate protein design.

Abstract

We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. Directed evolution has been a dominating paradigm in this field which has an iterative process to generate variants and select via experimental feedback. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequence optimizers. With a directed evolutionary method, LLM can perform protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes.

Large Language Model is Secretly a Protein Sequence Optimizer

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
Preliminary: Protein Sequence Optimization
Methodology
Experiment
Experiment Set-up
Main Experiment
Conclusion
Appendix
Pseudocode
Datasets analyze
Prompts
Pareto Frontier
Ablation Study of the Number of Iterations
Additional experiments result

Figures (7)

Figure 1: The overview of the optimization framework.
Figure 2: Pareto frontiers identified under constrained and budget-constrained optimization settings.
Figure 3: Pareto frontiers identified under multi-objective optimizations. We display the Pareto frontiers found for TrpB (a) and Syn-3bfo (b), using Pareto set selection (left) and sum of objectives (right), respectively. We also show the groundtruth Pareto frontiers for TrpB.
Figure 4: The fitness heatmaps of first two, last two, last three, and full sequence on two datasets.
Figure 5: The Pareto frontiers identified by EA and our method in both constrained and budget-constrained optimization settings for all parameter configurations.
...and 2 more figures

Large Language Model is Secretly a Protein Sequence Optimizer

TL;DR

Abstract

Large Language Model is Secretly a Protein Sequence Optimizer

Authors

TL;DR

Abstract

Table of Contents

Figures (7)