Computational Protein Science in the Era of Large Language Models (LLMs)

Wenqi Fan; Yi Zhou; Shijie Wang; Yuyao Yan; Hui Liu; Qian Zhao; Le Song; Qing Li

Computational Protein Science in the Era of Large Language Models (LLMs)

Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, Qing Li

TL;DR

This paper surveys computational protein science through the lens of large language models (LLMs), organizing protein language models (pLMs) into sequence-based, structure-/function-enhanced, and multimodal categories. It analyzes how pLMs contribute to structure prediction, function prediction, and design, including practical workflows for antibodies, enzymes, and drug discovery. Key contributions include systematic categorizations of pLM knowledge, strategies for utilization and adaptation, and a discussion of future challenges such as data scarcity, protein interactions, explainability, and computational efficiency. The work highlights the potential of pLMs to accelerate discovery by enabling end-to-end reasoning across sequence, structure, and function, while noting the need for bridging computational predictions with experimental validation and robust, scalable deployment.

Abstract

Considering the significance of proteins, computational protein science has always been a critical scientific field, dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. In the last few decades, Artificial Intelligence (AI) has made significant impacts in computational protein science, leading to notable successes in specific protein modeling tasks. However, those previous AI models still meet limitations, such as the difficulty in comprehending the semantics of protein sequences, and the inability to generalize across a wide range of protein modeling tasks. Recently, LLMs have emerged as a milestone in AI due to their unprecedented language processing & generalization capability. They can promote comprehensive progress in fields rather than solving individual tasks. As a result, researchers have actively introduced LLM techniques in computational protein science, developing protein Language Models (pLMs) that skillfully grasp the foundational knowledge of proteins and can be effectively generalized to solve a diversity of sequence-structure-function reasoning problems. While witnessing prosperous developments, it's necessary to present a systematic overview of computational protein science empowered by LLM techniques. First, we summarize existing pLMs into categories based on their mastered protein knowledge, i.e., underlying sequence patterns, explicit structural and functional information, and external scientific languages. Second, we introduce the utilization and adaptation of pLMs, highlighting their remarkable achievements in promoting protein structure prediction, protein function prediction, and protein design studies. Then, we describe the practical application of pLMs in antibody design, enzyme design, and drug discovery. Finally, we specifically discuss the promising future directions in this fast-growing field.

Computational Protein Science in the Era of Large Language Models (LLMs)

TL;DR

Abstract

Paper Structure (35 sections, 13 figures, 4 tables)

This paper contains 35 sections, 13 figures, 4 tables.

Introduction
Background
Biological Basis and Data Profiles
AI for Protein Science
Large Language Models (LLMs)
Pre-trained Protein Language Models
Sequence-based pLMs
Single-Sequence-based pLMs
Multiple-Sequences-based pLMs
Structure and Function Enhanced pLMs
Structure Enhanced pLMs
Function Enhanced pLMs
Multimodal pLMs
Utilization and Adaptation of Protein Language Models
Protein Structure Prediction
...and 20 more sections

Figures (13)

Figure 1: Illustration of the Evolution and Sequence-Structure-Function Relationships. (A) The arrangement of amino acids forms a vast space of possible protein sequences. However, only a few proteins can survive through millions of years of evolution. (B) Valid amino acid sequences would fold into stable 3D structures and carry out proper functions. (C) Information flow within the sequence-structure-function paradigm can be leveraged in reverse, leading to the optimization of existing proteins or de novo protein design oriented by desired functions.
Figure 2: Biological basis and data profiles. (A) Protein synthesis mainly involves the transcription of protein-coding genes to mRNAs and the translation of codon sequences to AA sequences. (B) Multiple Sequence Alignment (MSA) contains the evolutionary prior knowledge of proteins. Conserved positions are interpreted as core AAs for protein structure, as no changes have been allowed throughout the evolutionary process. Pairs of coevoluted positions indicate the spatial contacts of AAs, since mutations would occur and act synergistically to preserve the structure stable and unchanged. (C) Protein structure exhibits hierarchical organization. (D) Protein structure can be described in several forms. 3D coordinates of atoms trustfully record the experimentally determined protein conformation. A 2D distance map conveys the proximity between all possible AA pairs. Furthermore, we can build specific graphs to describe detailed structural characteristics, where the interatomic or inter-residue distances, angles, and directions are encoded as node and edge features. (E) An antibody is a Y-shaped protein composed of two heavy and two light chains. At the top of the "Y"'s arms, complementarity-determining regions (CDRs) are polypeptide segments that make up the antigen binding site. (F) Protein function is described in multiple formats, such as lab-generated labels, Gene Ontology annotations, and textual documents.
Figure 3: Typical single-sequence-based pLMs. 1-3) When considering individual amino acid sequences as "sentences", pLMs follow the general approaches of autoencoding, autoregressive, and sequence-to-sequence as well. 4) Masked CDR reconstruction is a novel pre-training objective that incorporates the inherent characteristics of antibodies into mask language modeling.
Figure 4: Typical multiple-sequences-based pLMs. 1) ESM-MSA-1b, a representative MSA-based pLM, incorporates bidirectional tied-row attention and column attention within each MSA Transformer block, thereby capturing co-evolution features within the 2D input. 2) PoET is an autoregressive model specifically designed to learn the distribution over protein families. It accepts multiple sequences as input without the need for alignment and can generate sets of homologous proteins.
Figure 5: Typical structure-enhanced pLMs. 1) Pre-calculated structural features can be injected into the input AA sequence as position encoding, or utilized in an additional training objective. 2) Considering the significant correlation between the Transformer attention map and protein structural contacts, structural graphs can be encoded by GNNs and combined with the attention module of pLMs. 3) Local structure states along the polypeptide chain are distilled into discrete tokens, which are subsequently involved in the training procedure of pLMs. 4) ESM-3 presents the sequence, structure, and functions of protein as multiple tracks of discrete tokens, with all kinds of information fused within a unified latent space. In particular, there is an additional geometric attention contained in the first Transformer block to process the protein backbone structure.
...and 8 more figures

Computational Protein Science in the Era of Large Language Models (LLMs)

TL;DR

Abstract

Computational Protein Science in the Era of Large Language Models (LLMs)

Authors

TL;DR

Abstract

Table of Contents

Figures (13)