Table of Contents
Fetching ...

ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining

Zhiyuan Liu, Yaorui Shi, An Zhang, Sihang Li, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua

TL;DR

ReactXT tackles the gap in reaction-text modeling by introducing reaction-contextualized pretraining with forward/backward reaction and random-molecule contexts, coupled with a balanced sampling strategy and a MolCA-based multi-modal LM backbone. It pairs this pretraining with the OpenExp dataset to evaluate experimental-procedure prediction, molecule captioning, and retrosynthesis, achieving state-of-the-art performance on procedure prediction and notable gains in molecule captioning while remaining competitive in retrosynthesis. The approach demonstrates how textual descriptions and molecular structures can be jointly leveraged to improve reaction understanding, enabling more reliable text-based interfaces for chemical synthesis tasks. The work contributes a new open benchmark and a scalable pretraining paradigm that can impact automated synthesis planning and molecule-aware natural language interfaces.

Abstract

Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for helping the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT.

ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining

TL;DR

ReactXT tackles the gap in reaction-text modeling by introducing reaction-contextualized pretraining with forward/backward reaction and random-molecule contexts, coupled with a balanced sampling strategy and a MolCA-based multi-modal LM backbone. It pairs this pretraining with the OpenExp dataset to evaluate experimental-procedure prediction, molecule captioning, and retrosynthesis, achieving state-of-the-art performance on procedure prediction and notable gains in molecule captioning while remaining competitive in retrosynthesis. The approach demonstrates how textual descriptions and molecular structures can be jointly leveraged to improve reaction understanding, enabling more reliable text-based interfaces for chemical synthesis tasks. The work contributes a new open benchmark and a scalable pretraining paradigm that can impact automated synthesis planning and molecule-aware natural language interfaces.

Abstract

Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for helping the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT.
Paper Structure (25 sections, 1 equation, 7 figures, 29 tables)

This paper contains 25 sections, 1 equation, 7 figures, 29 tables.

Figures (7)

  • Figure 1: Comparison of molecule-text generative modeling methods. Orange arrows $\mathrel{}$ denote the chemical relations for generation. 2D graph embeddings MolCA are omitted here for simplicity, but are added in the final framework for improved performance. $$\mathtt{DESC_j}$ denotes the description of the $j$-th molecule. The chemical reaction in Figures (b) and (d) is: COC(OC)N(C)C + CCC(=O)CC(=O)OC $\rightarrow$ CCC(=O)/C(=C/N(C)C)C(=O)OC.
  • Figure 2: Illustration of the experimental procedure prediction task and its dataset curation process. We employ the actions defined by smiles2actions and the description to action model from TextChemT5.
  • Figure 3: Illustration of Reaction-Contextualized Molecule-Text Pretraining. Example uses forward reaction context.
  • Figure 4: Distribution of molecules in the pretraining chemical reactions. For after adjustment, we conduct weighted sampling of chemical reactions matching the size of the pretraining dataset.
  • Figure 5: Human evaluations on OpenExp.
  • ...and 2 more figures