Table of Contents
Fetching ...

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang

TL;DR

TourSynbio advances protein engineering by creating TourSynbio-7B, a multi-modal large model that learns protein sequences as language without external encoders, and TourSynbio-Agent, an AI-agent framework that unifies diverse protein-engineering tools under a conversational interface. Trained with ProteinLMDataset (17.46B self-supervised tokens and 893K instructions) on InternLM2-7B, TourSynbio-7B achieves state-of-the-art ProteinLMBench performance (62.18% accuracy) and outperforms GPT-4-turbo in this domain. The Agent architecture integrates intent detection, keyword routing with fuzzy matching, user-guided selection, parameter extraction, and end-to-end execution, paired with a human-centered UI that supports model/agent selection and file uploads. The authors validate the approach with two wet-lab case studies (vanilla enzyme modification and P450 steroid catalysis), showing substantial gains in mutation accuracy, delivery time, and automation, while acknowledging room for improvements in complex structure prediction by incorporating advanced structural models in future work.

Abstract

The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B's enhanced protein sequence understanding capability, we introduce TourSynbio-Agent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis.

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

TL;DR

TourSynbio advances protein engineering by creating TourSynbio-7B, a multi-modal large model that learns protein sequences as language without external encoders, and TourSynbio-Agent, an AI-agent framework that unifies diverse protein-engineering tools under a conversational interface. Trained with ProteinLMDataset (17.46B self-supervised tokens and 893K instructions) on InternLM2-7B, TourSynbio-7B achieves state-of-the-art ProteinLMBench performance (62.18% accuracy) and outperforms GPT-4-turbo in this domain. The Agent architecture integrates intent detection, keyword routing with fuzzy matching, user-guided selection, parameter extraction, and end-to-end execution, paired with a human-centered UI that supports model/agent selection and file uploads. The authors validate the approach with two wet-lab case studies (vanilla enzyme modification and P450 steroid catalysis), showing substantial gains in mutation accuracy, delivery time, and automation, while acknowledging room for improvements in complex structure prediction by incorporating advanced structural models in future work.

Abstract

The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B's enhanced protein sequence understanding capability, we introduce TourSynbio-Agent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis.
Paper Structure (38 sections, 8 figures, 2 tables)

This paper contains 38 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Illustration of the traditional LLM and TourSynbio-7B for protein sequence understanding. (a) Traditional methods use external encoders to process protein sequences before feeding them into LLMs. (b) The proposed TourSynbio-7B method directly processes protein sequences using an LLM without the need for external encoders, thus simplifying the workflow and improving efficiency.
  • Figure 2: The overall workflow of TourSynbio-Agent from initial user input to final task execution. It encompasses several stages: (1) Intent classification using TourSynbio-7B to determine if an agent call is required; (2) Keyword-based agent selection with fuzzy matching to identify the appropriate agent; (3) User-guided selection for functionally similar agents when necessary; (4) Parameter extraction and interactive validation to ensure accurate inputs; and (5) Agent execution. The process incorporates multiple decision points and feedback loops, allowing for efficient handling of various scenarios, including false positives and ambiguous user inputs. This design ensures robust and accurate execution of protein engineering tasks while maintaining a user-friendly interface.
  • Figure 3: The Human-centered conversational user interface of TourSynbio-Agent. It integrates four key components: (1) Model Selection, allowing users to choose between TourSynbio-7B and other language models; (2) Agent Selection, enabling customization of active protein engineering tools; (3) File Upload, supporting various file formats for data input; and (4) Text Input area for natural language interactions. The interface also showcases real-time visualizations and results from various protein engineering tasks, including the PyMOL, ESMFold d3, and CaLM outeiral2024codon agents. This intuitive design facilitates seamless interaction between users and the AI-driven protein engineering models, enhancing workflow efficiency and accessibility for both experts and non-experts in the field. A video demo is available at https://github.com/tsynbio/TourSynbio/blob/main/demo/video_demo.mp4.
  • Figure 4: Comparison between traditional programming-based and TourSynbio-Agent for protein engineering tasks. The left panel illustrates a code snippet for mutation prediction using a deep learning model, while the right panel showcases the TourSynbio-Agent's user-friendly conversational interface for the same task, utilizing the ESM-1v model. This highlights the ease and efficiency of the TourSynbio-Agent, which abstracts complex code into intuitive interactions, thereby streamlining the protein engineering workflow.
  • Figure 5: Result on the vanilla key enzyme modification. (a) Summary of TourSynbio-Agent's performance improvements in enzyme engineering: reduced dry lab delivery time to less than one day, enhanced enzyme activity by over four times, and lowered production costs by four times. (b) Table showing various mutants and their concentrations, highlighting E10W in red due to its significantly higher concentration. (c) Table displaying various protein mutants ranked by their selectivity, with the red mutant highlighted as the top performer.
  • ...and 3 more figures