Towards Efficient Large Language Models for Scientific Text: A Review
Huy Quoc To, Ming Liu, Guangyan Huang
TL;DR
This paper surveys efficient large language model (LLM) approaches for scientific text, addressing the substantial resource and accessibility barriers that accompany large-scale models. It frames efficiency as a dual strategy: model-centric methods (e.g., parameter-efficient fine-tuning, adapters, and distillation) and data-centric approaches (data collection/selection, active learning), examining their applicability across diverse scientific domains. The review covers domain-specific advances in biology, biomedicine, clinical research, mathematics, geoscience, chemistry, ocean science, and cross-disciplinary models, highlighting concrete techniques such as LoRA, BioKnowPrompt, WizardMath, K2, TextEdge, OCEANGPT, DARWIN, and SciGLM. It identifies core challenges—data labeling, data quality, multi-LLM integration, catastrophic forgetting, multimodal integration, and cost—and proposes actionable directions to enable broader, sustainable adoption of scientific LLMs.
Abstract
Large language models (LLMs) have ushered in a new era for processing complex information in various fields, including science. The increasing amount of scientific literature allows these models to acquire and understand scientific knowledge effectively, thus improving their performance in a wide range of tasks. Due to the power of LLMs, they require extremely expensive computational resources, intense amounts of data, and training time. Therefore, in recent years, researchers have proposed various methodologies to make scientific LLMs more affordable. The most well-known approaches align in two directions. It can be either focusing on the size of the models or enhancing the quality of data. To date, a comprehensive review of these two families of methods has not yet been undertaken. In this paper, we (I) summarize the current advances in the emerging abilities of LLMs into more accessible AI solutions for science, and (II) investigate the challenges and opportunities of developing affordable solutions for scientific domains using LLMs.
