Table of Contents
Fetching ...

From Generalist to Specialist: A Survey of Large Language Models for Chemistry

Yang Han, Ziping Wan, Lu Chen, Kai Yu, Xin Chen

TL;DR

This survey maps the landscape of chemistry-oriented large language models, diagnosing three core bottlenecks: limited domain knowledge, fragmented multi-modal data, and insufficient tool integration. It categorizes approaches along domain knowledge pathways—continued pre-training, supervised fine-tuning, and RLHF—and details multi-modal strategies for 1D sequences, 2D graphs, 3D structures, and other modalities, alongside tool-based grounding with retrieval, ML models, and embodied robots. The authors discuss existing benchmarks and outline concrete future directions, emphasizing data diversity, explicit chain-of-thought reasoning, multi-modal alignment, and autonomous experimentation to advance chemistry AI. Overall, the work offers a structured roadmap for developing chemistry-specific LLMs that can assist researchers, ground outputs in real-world chemistry data, and potentially automate parts of the experimental workflow, thereby accelerating discovery.

Abstract

Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP). However, the predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry. The scarcity of specialized chemistry data, coupled with the complexity of multi-modal data such as 2D graph, 3D structure and spectrum, present distinct challenges. Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs. In this paper, we outline methodologies for incorporating domain-specific chemistry knowledge and multi-modal information into LLMs, we also conceptualize chemistry LLMs as agents using chemistry tools and investigate their potential to accelerate scientific research. Additionally, we conclude the existing benchmarks to evaluate chemistry ability of LLMs. Finally, we critically examine the current challenges and identify promising directions for future research. Through this comprehensive survey, we aim to assist researchers in staying at the forefront of developments in chemistry LLMs and to inspire innovative applications in the field.

From Generalist to Specialist: A Survey of Large Language Models for Chemistry

TL;DR

This survey maps the landscape of chemistry-oriented large language models, diagnosing three core bottlenecks: limited domain knowledge, fragmented multi-modal data, and insufficient tool integration. It categorizes approaches along domain knowledge pathways—continued pre-training, supervised fine-tuning, and RLHF—and details multi-modal strategies for 1D sequences, 2D graphs, 3D structures, and other modalities, alongside tool-based grounding with retrieval, ML models, and embodied robots. The authors discuss existing benchmarks and outline concrete future directions, emphasizing data diversity, explicit chain-of-thought reasoning, multi-modal alignment, and autonomous experimentation to advance chemistry AI. Overall, the work offers a structured roadmap for developing chemistry-specific LLMs that can assist researchers, ground outputs in real-world chemistry data, and potentially automate parts of the experimental workflow, thereby accelerating discovery.

Abstract

Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP). However, the predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry. The scarcity of specialized chemistry data, coupled with the complexity of multi-modal data such as 2D graph, 3D structure and spectrum, present distinct challenges. Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs. In this paper, we outline methodologies for incorporating domain-specific chemistry knowledge and multi-modal information into LLMs, we also conceptualize chemistry LLMs as agents using chemistry tools and investigate their potential to accelerate scientific research. Additionally, we conclude the existing benchmarks to evaluate chemistry ability of LLMs. Finally, we critically examine the current challenges and identify promising directions for future research. Through this comprehensive survey, we aim to assist researchers in staying at the forefront of developments in chemistry LLMs and to inspire innovative applications in the field.
Paper Structure (33 sections, 1 equation, 4 figures, 3 tables)

This paper contains 33 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Three common errors in general LLMs arising from the key challenges.
  • Figure 2: Taxonomy of currect approachs for transfering general LLMs to specialized chemistry LLMs.
  • Figure 3: For example, the compound $C_{8}H_{11}NO$ can be represented across various modalities. 1D sequeues include SMILES, IUPAC name and so on. Molecular structure consist of 2D graphs and 3D structures, 2D graphs encompass three matrices: atomic features, atom connection, chemical bonds features, 3D strutures compromise the coordinate of every atom. Other modalities consist of mass spectra, images, and so on.
  • Figure 4: The compositional structure of representative SFT dataset. The definition of tasks above the the horizontal lines is shown in Table \ref{['tab:sft-task']}, the source and size of the different tasks are indicated below the horizontal lines, and percentages on the pie charts are present to show the difference of different dataset.