Language models in molecular discovery

Nikita Janakarajan; Tim Erdmann; Sarath Swaminathan; Teodoro Laino; Jannis Born

Language models in molecular discovery

Nikita Janakarajan, Tim Erdmann, Sarath Swaminathan, Teodoro Laino, Jannis Born

TL;DR

The role of language models in molecular discovery is reviewed, underlining their strength in de novo drug design, property prediction and reaction chemistry, and valuable open-source software assets are highlighted thus lowering the entry barrier to the field of scientific language modeling.

Abstract

The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.

Language models in molecular discovery

TL;DR

Abstract

Paper Structure (21 sections, 5 figures)

This paper contains 21 sections, 5 figures.

Introduction
Accelerated molecular discovery
Molecule Representation
Simplified Molecular Input Line-Entry System (SMILES)
Self Referencing Embedded Strings (SELFIES)
International Chemical Identifier (InChI)
Generative Modelling
Recurrent Neural Network (RNN)
Variational Autoencoder (VAE)
Transformer
Property Prediction
Software tools for scientific language modeling
Natural language models
GT4SD -- Generative modeling toolkits
RXN for Chemistry: Reaction and synthesis language models
...and 6 more sections

Figures (5)

Figure 1: A comparison of molecular discovery workflows: (a) classic approach, where each hypothesis (a.k.a. molecule) requires a new experimental cycle. (b) Accelerated molecular discovery cycle with machine-generated hypotheses and assisted validation, enabling simultaneous generation and testing of numerous molecules.
Figure 2: An illustration of popular ways of representing a chemical molecule as input to a ML model. The representations may be (a) String-based, such as SMILES, SELFIES, or InChI which use characters to represent different aspects of a molecule, (b) Structure-based, such as Graphs or MolFiles that encode connectivity and atomic position, and (c) Feature-based, such as Morgan Fingerprints, which encode local substructures as bits.
Figure 3: An illustration of conditional molecule generation using LMs. The process initiates with the collection and processing of multi-modal data, which is then compressed into a fixed-size latent representation. These representations are subsequently passed to a molecular generative model. The generated molecules then undergo in-silico property prediction, which is linked back to the generative model through a feedback loop during training. The in-silico models direct the generative model to produce property- or task-driven molecules using a reward function. In the inference stage, candidate molecules generated by the optimized model continue through the workflow for lab synthesis and subsequent experimental validation to determine their efficacy for the desired task.
Figure 4: Screenshot of the LLM-powered chatbot application ChemChat. Embedding the capabilities of existing resources such as PubChem kim2019pubchem, RDKit landrum2013rdkit or GT4SD manica2023accelerating enables the assistant to execute programming routines in the background and thus answer highly subject-matter specific user requests without the user needing programming skills.
Figure 5: Screenshot of the LLM-powered chatbot application ChemChat showing the continuation of the conversation involving generative tasks through GT4SD's Regression Transformer born2023regression as well as property ertl2009estimation and similarity calculation tanimoto1957ibmrogers2010extended.

Language models in molecular discovery

TL;DR

Abstract

Language models in molecular discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (5)