Table of Contents
Fetching ...

General-Purpose Models for the Chemical Sciences: LLMs and Beyond

Nawaf Alampara, Anagha Aneesh, Martiño Ríos-García, Adrian Mirza, Mara Schilling-Wilhelmi, Ali Asghar Aghajani, Meiling Sun, Gordan Prastalo, Kevin Maik Jablonka

TL;DR

This review analyzes how general-purpose models (GPMs), including large language models, can transform chemistry and materials science by addressing the data diversity, scale, and tacit knowledge intrinsic to the field. It introduces a broad framework encompassing data representations, pre-training, fine-tuning, post-training alignment, and system-level agentic architectures, while detailing multimodal and optimization strategies suited to chemical data. The authors synthesize current state-of-the-art approaches, benchmark limitations, and the practical challenges of deploying GPMs in real labs, emphasizing safety, ethics, and evaluation standards. They argue that while GPMs offer substantial potential for automating workflows, hypothesis generation, and experiment execution, robust validation, standardized evaluation, and responsible governance are essential to realize transformative, reliable, and safe autonomous scientific systems in chemistry and related domains.

Abstract

Data-driven techniques have a large potential to transform and accelerate the chemical sciences. However, chemical sciences also pose the unique challenge of very diverse, small, fuzzy datasets that are difficult to leverage in conventional machine learning approaches. A new class of models, which can be summarized under the term general-purpose models (GPMs) such as large language models, has shown the ability to solve tasks they have not been directly trained on, and to flexibly operate with low amounts of data in different formats. In this review, we discuss fundamental building principles of GPMs and review recent and emerging applications of those models in the chemical sciences across the entire scientific process. While many of these applications are still in the prototype phase, we expect that the increasing interest in GPMs will make many of them mature in the coming years.

General-Purpose Models for the Chemical Sciences: LLMs and Beyond

TL;DR

This review analyzes how general-purpose models (GPMs), including large language models, can transform chemistry and materials science by addressing the data diversity, scale, and tacit knowledge intrinsic to the field. It introduces a broad framework encompassing data representations, pre-training, fine-tuning, post-training alignment, and system-level agentic architectures, while detailing multimodal and optimization strategies suited to chemical data. The authors synthesize current state-of-the-art approaches, benchmark limitations, and the practical challenges of deploying GPMs in real labs, emphasizing safety, ethics, and evaluation standards. They argue that while GPMs offer substantial potential for automating workflows, hypothesis generation, and experiment execution, robust validation, standardized evaluation, and responsible governance are essential to realize transformative, reliable, and safe autonomous scientific systems in chemistry and related domains.

Abstract

Data-driven techniques have a large potential to transform and accelerate the chemical sciences. However, chemical sciences also pose the unique challenge of very diverse, small, fuzzy datasets that are difficult to leverage in conventional machine learning approaches. A new class of models, which can be summarized under the term general-purpose models (GPMs) such as large language models, has shown the ability to solve tasks they have not been directly trained on, and to flexibly operate with low amounts of data in different formats. In this review, we discuss fundamental building principles of GPMs and review recent and emerging applications of those models in the chemical sciences across the entire scientific process. While many of these applications are still in the prototype phase, we expect that the increasing interest in GPMs will make many of them mature in the coming years.

Paper Structure

This paper contains 186 sections, 3 equations, 25 figures, 6 tables.

Figures (25)

  • Figure 1: State space description for chemistry at different scales. We illustrate how the number of hidden variables (gray) is growing with scale and complexity. For simple systems, we can explicitly write down all variables with their values and perfectly describe the system. For more complex systems---closer to practical applications---we can no longer do that. Many more variables cannot be explicitly enumerated.
  • Figure 2: Cumulative token count based on the ChemPile tabular datasets mirza2025chempile0. We compare the approximate token count for three datasets: Llama-3 training dataset,grattafiori2024llama openly available chemistry papers in the ChemPile-Paper dataset, and the ChemPile-LIFT dataset. As can be seen, by aggregating the collection of tabular datasets converted to text format in the ChemPile-LIFT subset, we can achieve the same order of magnitude as the collection of open chemistry papers. However, without smaller datasets, we cannot capture the breadth and complexity of chemistry data, which is essential for training . The tokenization method for both ChemPile and Llama-3 is provided in the respective papers.
  • Figure 3: Dataset creation protocols. In "top-down" approaches, we curate a large corpus of data, which can be used to train . The "bottom-up" approach starts from a problem definition, and the dataset can be collected via literature mining and experiments. Both approaches can use synthetic data to increase the data size and diversity.
  • Figure 4: General training workflow through the lens of molecular science. The figure illustrates the progression from pre-training through fine-tuning to post-training stages. (1) Pre-training: The model learns the underlying data distribution from a vast, unlabeled dataset. This is visualized as transforming an unstructured representation space (left, square cloud) into a structured manifold (the Swiss roll). At this stage, the model has learned the "shape" of the data: the fundamental rules that make a molecule chemically valid. However, the representations are not yet specialized for any task. (2) Fine-tuning: The model is trained on specific, labeled tasks, such as predicting solubility (flask icon) and toxicity (skull icon). This process "colors" the manifold, adjusting the learned representations so that their position now also correlates with specific properties (e.g., blue for one property profile, red for another). (3) Post-training Alignment: The model's behavior is biased towards desired outcomes. This is visualized as preferentially sampling from a specific region of the colored manifold, such as generating molecules predicted to have high solubility and low toxicity (right, the brighter red region).
  • Figure 5: Main families in . The figure illustrates the two primary approaches, each using different strategies to generate pseudo-labels from the data itself. Generative Methods (Top Panel): This family focuses on reconstruction and prediction. The model learns representations by generating missing information. Examples shown correspond to the pretext tasks discussed in the text: (1) Predicting masks in a graph, analogous to masked modeling (more details in \ref{['sec:masked_modeling']}); (2) Learning from context, which is the basis for next token prediction (more details in \ref{['sec:next_token_prediction']}); and (3) Learning to denoise, where the model reconstructs a clean input from a corrupted version. (see \ref{['sec:denoising']}) Contrastive Learning (Bottom Panel): This family learns by comparing samples. The model is trained to pull representations of similar samples together while pushing dissimilar ones apart. Examples include: (1) Aligning embeddings from different augmentations of the same molecule, a core idea in Instance Discrimination (more details in \ref{['sec:instance_discrimination']}); (2) Learning to cluster similar molecules together, as in Clustering-based Contrastive Learning (see \ref{['sec:clustering_cl']}); and (3) Cross-modal alignment, where representations from different data types (e.g., a molecule's graph and its spectral properties) are learned jointly. (see \ref{['sec:contrastive_learning']} )
  • ...and 20 more figures