Table of Contents
Fetching ...

A Survey of Pre-trained Language Models for Processing Scientific Text

Xanh Ho, Anh Khoa Duong Nguyen, An Tuan Dao, Junfeng Jiang, Yuki Chida, Kaito Sugimoto, Huy Quoc To, Florian Boudin, Akiko Aizawa

TL;DR

This survey comprehensively catalogs Pre-trained Language Models for Processing Scientific Text (SciLMs) across 2019–2023, analyzing over 110 models in biomedical, chemical, multi-domain, and other scientific domains. It identifies encoder-based architectures as the dominant paradigm, with biomedical literature driving data availability and model proliferation, while multilingual and larger-scale models are emerging. The paper assesses effectiveness via top tasks and datasets, revealing improvements over time but also limitations due to narrow task coverage and evaluation benchmarks. It highlights challenges in multilingual expansion, knowledge integration, and multi-modal capabilities, and argues for foundation SciLMs and broader, standardized evaluation to accelerate robust scientific NLP progress. The findings have practical impact by guiding researchers toward more generalizable, trustworthy, and scalable SciLM development and evaluation practices.

Abstract

The number of Language Models (LMs) dedicated to processing scientific text is on the rise. Keeping pace with the rapid growth of scientific LMs (SciLMs) has become a daunting task for researchers. To date, no comprehensive surveys on SciLMs have been undertaken, leaving this issue unaddressed. Given the constant stream of new SciLMs, appraising the state-of-the-art and how they compare to each other remain largely unknown. This work fills that gap and provides a comprehensive review of SciLMs, including an extensive analysis of their effectiveness across different domains, tasks and datasets, and a discussion on the challenges that lie ahead.

A Survey of Pre-trained Language Models for Processing Scientific Text

TL;DR

This survey comprehensively catalogs Pre-trained Language Models for Processing Scientific Text (SciLMs) across 2019–2023, analyzing over 110 models in biomedical, chemical, multi-domain, and other scientific domains. It identifies encoder-based architectures as the dominant paradigm, with biomedical literature driving data availability and model proliferation, while multilingual and larger-scale models are emerging. The paper assesses effectiveness via top tasks and datasets, revealing improvements over time but also limitations due to narrow task coverage and evaluation benchmarks. It highlights challenges in multilingual expansion, knowledge integration, and multi-modal capabilities, and argues for foundation SciLMs and broader, standardized evaluation to accelerate robust scientific NLP progress. The findings have practical impact by guiding researchers toward more generalizable, trustworthy, and scalable SciLM development and evaluation practices.

Abstract

The number of Language Models (LMs) dedicated to processing scientific text is on the rise. Keeping pace with the rapid growth of scientific LMs (SciLMs) has become a daunting task for researchers. To date, no comprehensive surveys on SciLMs have been undertaken, leaving this issue unaddressed. Given the constant stream of new SciLMs, appraising the state-of-the-art and how they compare to each other remain largely unknown. This work fills that gap and provides a comprehensive review of SciLMs, including an extensive analysis of their effectiveness across different domains, tasks and datasets, and a discussion on the challenges that lie ahead.
Paper Structure (56 sections, 13 figures, 15 tables)

This paper contains 56 sections, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Overall structure of our survey.
  • Figure 2: Evolutionary tree of SciLMs. The nodes are color-coded based on their domains: blue for biomedical, pink for chemical, yellow for multi-domain, green for other domains, and gray for general domain models. The node is filled in white if the model is closed-source; otherwise, it is open-source. The English version of a model is used if it has multiple languages, and the most efficient variant is used if a model has multiple variants. SciLMs that use continual pretraining are represented as children of the model whose weights they initialize. Only popular models are depicted as parent nodes in the tree for clarity. SciLMs trained from scratch are placed as leaves in the rightmost branch.
  • Figure 3: Existing tasks in processing scientific text.
  • Figure 4: Distribution of model sizes.
  • Figure 5: Average performance changes in the NER task. These scores are the average from the five NER datasets: NCBI-disease, BC5CDR-disease, JNLPBA, BC5CDR-chemical, and BC2GM.
  • ...and 8 more figures