Table of Contents
Fetching ...

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

Yufan Chen, Ching Ting Leung, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao

TL;DR

RxnIM introduces a first-of-its-kind multimodal LLM designed to parse chemical reaction images into machine-readable data. Trained in three stages on a large synthetic dataset and refined with real images, it jointly performs reaction component identification and reaction condition interpretation, achieving an average F1 of 88% and outperforming existing methods by about 5%. The system is deployed via a Gradio web app (RxnIM.web) and released with open-source data and code to facilitate automated construction of large, machine-readable reaction databases. This work enables scalable, AI-ready curation of chemistry literature data and broadens the use of multimodal models in image-based cheminformatics.

Abstract

Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at https://huggingface.co/spaces/CYF200127/RxnIM.

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

TL;DR

RxnIM introduces a first-of-its-kind multimodal LLM designed to parse chemical reaction images into machine-readable data. Trained in three stages on a large synthetic dataset and refined with real images, it jointly performs reaction component identification and reaction condition interpretation, achieving an average F1 of 88% and outperforming existing methods by about 5%. The system is deployed via a Gradio web app (RxnIM.web) and released with open-source data and code to facilitate automated construction of large, machine-readable reaction databases. This work enables scalable, AI-ready curation of chemistry literature data and broadens the use of multimodal models in image-based cheminformatics.

Abstract

Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at https://huggingface.co/spaces/CYF200127/RxnIM.

Paper Structure

This paper contains 30 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Dataset generation and overview of the proposed RxnIM.a, Synthetic dataset generation pipeline. We obtain textual reaction information in the Pistachio dataset, generate visual reaction components, and create sub-images based on predefined reaction patterns. These sub-images are combined to form the final synthetic reaction image. This process resulted in the creation of a large-scale chemical reaction image parsing dataset containing 60,000 diverse images. b, Model architecture of our RxnIM. The model incorporates four key components: 1) A unified task instruction for standardizing chemical reaction image parsing tasks, 2) A multimodal encoder that aligns image information with task instructions, 3) A ReactionImg tokenizer to convert image features into tokens, and 4) An open-ended LLM decoder that generates the final output. c, Workflow for chemical reaction image parsing using RxnIM, where results from two tasks are combined and molecular structures are converted into machine-readable formats like SMILES or Molfile.
  • Figure 2: Comparison of model performance on the reaction component identification task on four different patterns of reaction images on the real test dataset. We display precision, recall, and $F_1$ scores in hard match and soft match, of our model and current methods across four patterns of reaction images: Single Line, Multiple Line, Branch, and Cycle. The performance is evaluated to demonstrate the models' capabilities in accurately extracting reactions under varying image complexities and layouts.
  • Figure 3: Visualization examples of the model's prediction on the reaction component identification task compared to the current best method RxnScribe. We display the comparison between RxnScribe and RxnIM on the reaction component identification task across three different prediction examples. Each predicted reaction is visualized in a separate image, showing the predicted reaction components, including reactants, conditions, and products, with color-coded boxes representing different component types. Check marks and cross marks indicate correct and incorrect predictions, respectively, under the hard match criteria. The red dashed circle indicates that the reaction is not predicted. The DOI numbers of the relevant journal articles for these real reaction images can be found in Supplementary Note 3 and Supplementary Table 3.
  • Figure 4: More visualization examples of the model's prediction on the reaction component identification task. We showcase more complex examples of predicted reactions, each visualized in separate images. Prediction 1 is a multiple-line reaction image with four reactions, Prediction 2 is a branch reaction image with three reactions, and prediction 3 is a cycle reaction image with nine reactions.
  • Figure 5: Model performance on the reaction condition interpretation task.a, overall performance on the reaction condition interpretation task in OCR and CRI (Condition Role Identification) accuracy. b, the CRI performance in precision, recall, and $F_1$ scores on five different condition roles: agent, solvent, temperature, time, and yield. c, the confusion matrix detailing the model's performance in correctly identifying these condition roles, highlighting areas of accurate and confused classifications.
  • ...and 4 more figures