Table of Contents
Fetching ...

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Pengfei Jiang, Cheng Tang, Ziyan Huang, Jiyao Liu, Jiaqi Wei, Yuejin Yang, Xiang Zhang, Guangshuai Wang, Yue Yang, Huihui Xu, Ziyang Chen, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Dingkang Yang, Jinjie Wei, Jiaqi Wang, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Hongze Zhu, Yu Liu, Fudi Wang, Yiqing Shen, Yuanfeng Ji, Yanzhou Su, Tong Xie, Hongming Shan, Chun-Mei Feng, Zhi Hou, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Benyou Wang, Yuewen Cao, Minjie Shen, Jie Xu, Haodong Duan, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Zhongying Deng, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Wenqi Shao, Yihao Liu, Siqi Luo, Yi Xin, Xiaohong Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Siqi Sun, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Yirong Chen, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou

TL;DR

This survey reframes Scientific LLMs (Sci-LLMs) as a data-centric, co-evolutionary system where model capabilities are inseparable from the underlying data substrate. It introduces a unified taxonomy of scientific data and a hierarchical knowledge model to address multimodal, cross-scale, and domain-specific challenges, and analyzes hundreds of pre-/post-training datasets and benchmarks to reveal data-centric bottlenecks. The work surveys six scientific domains, contrasts general-purpose and domain-specific Sci-LLMs, and discusses paradigm shifts toward scientific agents and data ecosystems that enable autonomous experimentation and closed-loop knowledge updating. It also identifies persistent data-quality, representation, and governance issues and proposes a roadmap for integrated data architectures, automated standardization, and continuous evaluation to enable trustworthy, evolving AI partners in scientific discovery.

Abstract

Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

TL;DR

This survey reframes Scientific LLMs (Sci-LLMs) as a data-centric, co-evolutionary system where model capabilities are inseparable from the underlying data substrate. It introduces a unified taxonomy of scientific data and a hierarchical knowledge model to address multimodal, cross-scale, and domain-specific challenges, and analyzes hundreds of pre-/post-training datasets and benchmarks to reveal data-centric bottlenecks. The work surveys six scientific domains, contrasts general-purpose and domain-specific Sci-LLMs, and discusses paradigm shifts toward scientific agents and data ecosystems that enable autonomous experimentation and closed-loop knowledge updating. It also identifies persistent data-quality, representation, and governance issues and proposes a roadmap for integrated data architectures, automated standardization, and continuous evaluation to enable trustworthy, evolving AI partners in scientific discovery.

Abstract

Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

Paper Structure

This paper contains 124 sections, 29 figures, 3 tables.

Figures (29)

  • Figure 1: The song of humanity is a song of courage. The diagram depicts the continuum of scientific inquiry spanning from subatomic particles through atomic and molecular structures, cellular and organismal biology, ecological systems, planetary sciences, to cosmological phenomena. Each tier represents distinct yet interconnected domains of investigation, illustrating the nested hierarchy of natural phenomena and the corresponding disciplinary frameworks employed in their study. This visualization encapsulates the expansion of scientific understanding from micro to macro dimensions, symbolizing humanity’s persistent pursuit of knowledge across all scales of nature.
  • Figure 2: Cumulative trend of publications on major preprint platforms whose titles or abstracts mention the keyword "language model" or the combination "language model + scientific domain" (e.g., chemistry, physics, multi-omics, medicine, etc.). Left: Results from January 2018 to August 2025, from arXiv and PubMed. For arXiv, the matching includes "language model" in combination with additional science-related keywords; PubMed results are limited to occurrences in titles and abstracts. Both platforms show rapid growth. Right: Results from 2020 to August 2025, from bioRxiv, medRxiv, and ChemRxiv, all based on direct matches of "language model" in titles and abstracts. While the overall volumes are smaller than arXiv and PubMed, all three platforms, especially bioRxiv, show rapid acceleration, reflecting growing interdisciplinary interest in large language models across biomedical, chemical, and computational sciences.
  • Figure 3: Evolution of Sci-LLMs reveals four paradigm shifts from 2018 to 2025, including (1) the progression from transfer learning approaches, (2) through the scaling era marked by knowledge integration in larger models, (3) instruction-following capabilities enabling flexible task adaptation, to (4) the latest paradigm introduces scientific agents—AI systems capable of autonomously conducting scientific research, from hypothesis generation and experimental design to data analysis and discovery. Note: Model positions reflect their release dates (x-axis) rather than strict paradigm classification. The four paradigms represent evolving trends in Sci-LLM development with overlaps and continuities, not mutually exclusive categories.
  • Figure 4: Six main scientific domains covered in this survey. The figure illustrates the primary disciplines investigated in our study on science-oriented large language models, encompassing Chemistry, Materials Science, Physics, Life Sciences, Astronomy, and Earth Science, along with representative subfields within each domain.
  • Figure 5: Examples of visual data across typical medical imaging modalities, involving radiology (PET, CT, mammography, X-ray, MRI, and ultrasound), dermatology, ophthalmology (CFP, FFA, UWF-SLO, and OCT), endoscopy, histopathology, and cellular microscopy. The figure is sourced from open-source medical datasets.
  • ...and 24 more figures