Table of Contents
Fetching ...

From Words to Molecules: A Survey of Large Language Models in Chemistry

Chang Liao, Yemin Yu, Yu Mei, Ying Wei

TL;DR

The paper surveys how to translate chemical knowledge into the language-model framework, detailing molecular representations, tokenization, and pretraining objectives used to train LLMs for chemistry. It introduces a three-way taxonomy of input domains (single-domain, multi-domain, multi-modal) and three pretraining objectives (MLM, MPP, ATG), plus cross-modal strategies, to enable tasks from property prediction to de novo design. By cataloging applications—chatbots, in-context learners, and representation learners—the work clarifies the practical roles LLMs can play in chemistry and identifies gaps in knowledge integration, continual learning, and interpretability. This synthesis provides a roadmap for building more capable, reliable chemical LLMs with broader impact in synthesis, discovery, and education.

Abstract

In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, paving the way for groundbreaking developments in the field.

From Words to Molecules: A Survey of Large Language Models in Chemistry

TL;DR

The paper surveys how to translate chemical knowledge into the language-model framework, detailing molecular representations, tokenization, and pretraining objectives used to train LLMs for chemistry. It introduces a three-way taxonomy of input domains (single-domain, multi-domain, multi-modal) and three pretraining objectives (MLM, MPP, ATG), plus cross-modal strategies, to enable tasks from property prediction to de novo design. By cataloging applications—chatbots, in-context learners, and representation learners—the work clarifies the practical roles LLMs can play in chemistry and identifies gaps in knowledge integration, continual learning, and interpretability. This synthesis provides a roadmap for building more capable, reliable chemical LLMs with broader impact in synthesis, discovery, and education.

Abstract

In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, paving the way for groundbreaking developments in the field.
Paper Structure (22 sections, 4 equations, 4 figures, 1 table)

This paper contains 22 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: LLMs for Chemistry: Applications and Paradigms
  • Figure 2: An overview of topics in this paper, with dash lines indicating their applicability to various downstream tasks.
  • Figure 3: An Example of Tokenized Output from Different Tokenizers for the Sequence "NC(=O)COc1ccc(Br)cc1"
  • Figure 4: Language Modelling Objectives