Scientific Large Language Models: A Survey on Biological & Chemical Domains

Qiang Zhang; Keyang Ding; Tianwen Lyv; Xinda Wang; Qingyu Yin; Yiwen Zhang; Jing Yu; Yuhao Wang; Xiaotong Li; Zhuoyi Xiang; Kehua Feng; Xiang Zhuang; Zeyuan Wang; Ming Qin; Mengyao Zhang; Jinlu Zhang; Jiyu Cui; Tao Huang; Pengju Yan; Renjun Xu; Hongyang Chen; Xiaolin Li; Xiaohui Fan; Huabin Xing; Huajun Chen

Scientific Large Language Models: A Survey on Biological & Chemical Domains

Qiang Zhang, Keyang Ding, Tianwen Lyv, Xinda Wang, Qingyu Yin, Yiwen Zhang, Jing Yu, Yuhao Wang, Xiaotong Li, Zhuoyi Xiang, Kehua Feng, Xiang Zhuang, Zeyuan Wang, Ming Qin, Mengyao Zhang, Jinlu Zhang, Jiyu Cui, Tao Huang, Pengju Yan, Renjun Xu, Hongyang Chen, Xiaolin Li, Xiaohui Fan, Huabin Xing, Huajun Chen

TL;DR

This survey consolidates the rapidly evolving field of scientific large language models (Sci-LLMs) with a focus on biological and chemical domains. It systematizes concepts of scientific languages, taxonomy of architectures (encoder-only, decoder-only, encoder-decoder), training pipelines, and domain-specific data sources, spanning textual, molecular, protein, genomic, and multimodal modalities. By cataloging textual Sci-LLMs and Mol/Prot/Gene-LLMs, and detailing multimodal models and their datasets/benchmarks, the paper exposes current capabilities and critical gaps, especially in data scale, evaluation, and integration of 3D structure and external knowledge. The discussion highlights practical implications for drug discovery, genomics, and molecular design, and proposes concrete directions—larger cross-modal data, 3D structural tokens, tool-enabled reasoning, and robust evaluation—to accelerate the AI-for-Science agenda while addressing ethical considerations. Together, these insights provide a foundational reference for researchers building and applying Sci-LLMs in biology and chemistry.

Abstract

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.

Scientific Large Language Models: A Survey on Biological & Chemical Domains

TL;DR

Abstract

Paper Structure (55 sections, 9 figures, 10 tables)

This paper contains 55 sections, 9 figures, 10 tables.

Introduction
Background
Background
Formulation of Scientific Languages
Taxonomy of Model Architectures
Pre-training and Fine-tuning
Notions and Terms
Textual Scientific Large Language Models
Models
Datasets
Evaluation
Summary
Molecular Large Language Models
Models
Encoder-only Models
...and 40 more sections

Figures (9)

Figure 1: Illustrations that general LLMs struggle to effectively handle scientific languages, such as molecules, RNA and amino acid sequences in this example.
Figure 2: Research scopes of Scientific Large Language Models (Sci-LLMs) in this survey. We focus on scientific languages (i.e., textual, molecular, protein and genomic languages), as well as their combination (i.e., multimodal language), within the realm of biochemical science.
Figure 3: Illustration of molecular, protein and genomic languages. Molecular languages include SMILEs, SELFIES and InChl sequences, and 2D topology and 3D geometry structures. Protein languages consist of primary structure (i.e., amino acid sequence), secondary, tertiary, and quaternary structures (3D). Genomic languages include DNA and RNA sequences/structures. This survey focuses solely on sequence modeling of molecular, protein and genomic languages.
Figure 4: Illustration of common architectures of Sci-LLMs, including (a) encoder-only, (b) decoder-only, and (c) encoder&decoder-based models, with (d) representing the tokens of scientific languages.
Figure 5: Chapter overview of Text-Sci-LLMs.
...and 4 more figures

Scientific Large Language Models: A Survey on Biological & Chemical Domains

TL;DR

Abstract

Scientific Large Language Models: A Survey on Biological & Chemical Domains

Authors

TL;DR

Abstract

Table of Contents

Figures (9)