Table of Contents
Fetching ...

Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search

Yaodong Yang, Yang Wang, Jinpeng Li, Pei Guo, Da Han, Guangyong Chen, Pheng-Ann Heng

TL;DR

AlphaDE tackles the gap between protein language models and advanced optimization in in-silico directed evolution by fine-tuning PLMs on homologous sequences and performing Monte Carlo tree search guided by the learned evolutionary priors. The framework achieves state-of-the-art results across eight benchmark tasks, demonstrates strong few-shot and zero-shot capabilities, and can condense protein sequence space in a proof-of-concept avGFP study. By integrating a scalable prior with an MCTS search, AlphaDE shows robust performance, diverse sequence generation, and potential for practical protein engineering, while acknowledging oracle biases and data requirements as limitations. Overall, the work presents a principled, high-signal approach to learnable, policy-driven directed evolution that leverages modern PLMs and search techniques.

Abstract

Protein evolution through amino acid mutations is a cornerstone of life sciences. Recent advances in protein language models have shown rich evolutionary patterns, offering unprecedented potential for in-silicon directed evolution. However, existing directed evolution methods largely rely on heuristic evolution strategies and have yet to efficiently integrate the transformative protein language models with advanced optimization techniques, such as reinforcement learning, to learn optimal evolution policies. To bridge this gap, we propose AlphaDE, a novel framework that evolves protein sequences by harnessing the innovative paradigms of large language models, such as fine-tuning and test-time inference. First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility of the interested protein family. Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine-tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. A case study further demonstrates that AlphaDE supports condensing the protein sequence space of avGFP through computational evolution.

Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search

TL;DR

AlphaDE tackles the gap between protein language models and advanced optimization in in-silico directed evolution by fine-tuning PLMs on homologous sequences and performing Monte Carlo tree search guided by the learned evolutionary priors. The framework achieves state-of-the-art results across eight benchmark tasks, demonstrates strong few-shot and zero-shot capabilities, and can condense protein sequence space in a proof-of-concept avGFP study. By integrating a scalable prior with an MCTS search, AlphaDE shows robust performance, diverse sequence generation, and potential for practical protein engineering, while acknowledging oracle biases and data requirements as limitations. Overall, the work presents a principled, high-signal approach to learnable, policy-driven directed evolution that leverages modern PLMs and search techniques.

Abstract

Protein evolution through amino acid mutations is a cornerstone of life sciences. Recent advances in protein language models have shown rich evolutionary patterns, offering unprecedented potential for in-silicon directed evolution. However, existing directed evolution methods largely rely on heuristic evolution strategies and have yet to efficiently integrate the transformative protein language models with advanced optimization techniques, such as reinforcement learning, to learn optimal evolution policies. To bridge this gap, we propose AlphaDE, a novel framework that evolves protein sequences by harnessing the innovative paradigms of large language models, such as fine-tuning and test-time inference. First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility of the interested protein family. Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine-tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. A case study further demonstrates that AlphaDE supports condensing the protein sequence space of avGFP through computational evolution.

Paper Structure

This paper contains 34 sections, 2 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: The framework of AlphaDE. It consists of a fine-tuning step and an MCTS inference step.
  • Figure 2: AlphaDE with fine-tuned ESM2-35M, which are fine-tuned with different numbers of sequences randomly sampled from the whole data distribution. 95% confidence intervals are shadowed.
  • Figure 3: AlphaDE with fine-tuned ESM2-35M models of different numbers of sequences randomly sampled from the bottom 20% data. The "Max Bottom 20%" denotes the maximum fitness value in the bottom 20% data. 95% confidence intervals are shadowed.
  • Figure 4: AlphaDE's performance scales with the sizes of pretrained protein language models. The black horizontal dashed line indicates AlphaDE with fine-tuned ESM2-35M. 95% confidence intervals are shadowed.
  • Figure 5: The illustrated process of AlphaDE to condense the sequence space of avGFP. The evolution trajectory is sampled during one trial, and the first structure is the prediction of the starting sequence by AlphaFold 3.
  • ...and 1 more figures