Large Language Model is Secretly a Protein Sequence Optimizer
Yinkai Wang, Jiaxing He, Yuanqi Du, Xiaohui Chen, Jianan Canal Li, Li-Ping Liu, Xiaolin Xu, Soha Hassoun
TL;DR
This work shows that pre-trained large language models can act as on-the-fly protein sequence optimizers within a directed-evolution framework, enabling Pareto- and budget-constrained search for high-fitness variants without additional fine-tuning. By sampling from the LLM and guiding mutation and crossover, the approach outperforms traditional evolutionary algorithms on complex landscapes and under practical experimental constraints. The study validates across exact, synthetic, and ML-based fitness landscapes, and across single- and multi-objective tasks, demonstrating the practicality of LLM-guided protein engineering. The results suggest a promising path toward integrating LLM-driven optimization into real-world directed evolution workflows to accelerate protein design.
Abstract
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. Directed evolution has been a dominating paradigm in this field which has an iterative process to generate variants and select via experimental feedback. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequence optimizers. With a directed evolutionary method, LLM can perform protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes.
