Table of Contents
Fetching ...

A Review of Large Language Models and Autonomous Agents in Chemistry

Mayk Caldas Ramos, Christopher J. Collison, Andrew D. White

TL;DR

This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry.

Abstract

Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning. As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains. This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry. Key challenges include data quality and integration, model interpretability, and the need for standard benchmarks, while future directions point towards more sophisticated multi-modal agents and enhanced collaboration between agents and experimental methods. Due to the quick pace of this field, a repository has been built to keep track of the latest studies: https://github.com/ur-whitelab/LLMs-in-science.

A Review of Large Language Models and Autonomous Agents in Chemistry

TL;DR

This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry.

Abstract

Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning. As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains. This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry. Key challenges include data quality and integration, model interpretability, and the need for standard benchmarks, while future directions point towards more sophisticated multi-modal agents and enhanced collaboration between agents and experimental methods. Due to the quick pace of this field, a repository has been built to keep track of the latest studies: https://github.com/ur-whitelab/LLMs-in-science.
Paper Structure (60 sections, 6 figures, 1 table)

This paper contains 60 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: AI-powered LLMs accelerate chemical discovery with models that address key challenges in Property Prediction, Property Directed Molecule Generation, and Synthesis Prediction. Autonomous agents connect these models and additional tools thereby enabling rapid exploration of vast chemical spaces.
  • Figure 2: a) The generalized encoder-decoder transformer: The encoder on the left converts an input into a vector, while the decoder on the right predicts the next token in a sequence. b) Encoder-decoder transformers are traditionally used for translation tasks and, in chemistry, for reaction prediction, translating reactants into products. c) Encoder-only transformers provide a vector output and are typically used for sentiment analysis. In chemistry, they are used for property prediction or classification tasks. d) Decoder-only transformers generate likely next tokens in a sequence. In chemistry, they are used to generate new molecules given an instruction and description of molecules.
  • Figure 3: Classification of LLMs in chemistry and biochemistry according to their application.
  • Figure 4: Illustration of how Large Language Models (LLMs) evolved chronologically. The dates display the first publication of each model.
  • Figure 5: Number of training tokens (on log scale) available from various chemical sources compared with typical LLM training runs. The numbers are drawn from ZINCirwin2012zinc, PubChemkim2016pubchem, touvron2023llama, ChEMBLGaulton2012-og, and Kinney2023-fj
  • ...and 1 more figures