Table of Contents
Fetching ...

Protein Large Language Models: A Comprehensive Survey

Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang

TL;DR

This survey addresses how Protein LLMs language-protein relationships to predict structure, function, and design by systematically cataloging architectures, data strategies, and evaluation protocols across the literature. It offers a taxonomy that ranges from single-sequence LLMs and MSA-based models to structure-aware and knowledge-enhanced approaches, and it surveys pretraining data and benchmarking resources used to assess performance. The work compiles datasets such as UniProt, Pfam, PDB, AlphaFoldDB, and benchmarks like CASP, ProteinGym, TAPE, and PEER, mapping popular metrics for structure, function, and generation tasks. It also outlines key challenges—protein dynamics, single-cell proteomics integration, interpretability, and domain knowledge incorporation—alongside future directions for more capable, trustworthy Protein LLMs with real-world biomedical impact.

Abstract

Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.

Protein Large Language Models: A Comprehensive Survey

TL;DR

This survey addresses how Protein LLMs language-protein relationships to predict structure, function, and design by systematically cataloging architectures, data strategies, and evaluation protocols across the literature. It offers a taxonomy that ranges from single-sequence LLMs and MSA-based models to structure-aware and knowledge-enhanced approaches, and it surveys pretraining data and benchmarking resources used to assess performance. The work compiles datasets such as UniProt, Pfam, PDB, AlphaFoldDB, and benchmarks like CASP, ProteinGym, TAPE, and PEER, mapping popular metrics for structure, function, and generation tasks. It also outlines key challenges—protein dynamics, single-cell proteomics integration, interpretability, and domain knowledge incorporation—alongside future directions for more capable, trustworthy Protein LLMs with real-world biomedical impact.

Abstract

Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.

Paper Structure

This paper contains 22 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An Overview of Tasks in Protein Large Language Models.
  • Figure 2: An Overview of Methods of Protein Large Language Models.
  • Figure 3: Taxonomy of Protein Large Language Models.
  • Figure 4: Illustrations on General Tasks of Protein Language Models.