A Survey of Pre-trained Language Models for Processing Scientific Text
Xanh Ho, Anh Khoa Duong Nguyen, An Tuan Dao, Junfeng Jiang, Yuki Chida, Kaito Sugimoto, Huy Quoc To, Florian Boudin, Akiko Aizawa
TL;DR
This survey comprehensively catalogs Pre-trained Language Models for Processing Scientific Text (SciLMs) across 2019–2023, analyzing over 110 models in biomedical, chemical, multi-domain, and other scientific domains. It identifies encoder-based architectures as the dominant paradigm, with biomedical literature driving data availability and model proliferation, while multilingual and larger-scale models are emerging. The paper assesses effectiveness via top tasks and datasets, revealing improvements over time but also limitations due to narrow task coverage and evaluation benchmarks. It highlights challenges in multilingual expansion, knowledge integration, and multi-modal capabilities, and argues for foundation SciLMs and broader, standardized evaluation to accelerate robust scientific NLP progress. The findings have practical impact by guiding researchers toward more generalizable, trustworthy, and scalable SciLM development and evaluation practices.
Abstract
The number of Language Models (LMs) dedicated to processing scientific text is on the rise. Keeping pace with the rapid growth of scientific LMs (SciLMs) has become a daunting task for researchers. To date, no comprehensive surveys on SciLMs have been undertaken, leaving this issue unaddressed. Given the constant stream of new SciLMs, appraising the state-of-the-art and how they compare to each other remain largely unknown. This work fills that gap and provides a comprehensive review of SciLMs, including an extensive analysis of their effectiveness across different domains, tasks and datasets, and a discussion on the challenges that lie ahead.
