Table of Contents
Fetching ...

SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images

Ching Ting Leung, Yufan Chen, Hanyu Gao

TL;DR

SMiCRM addresses the limited benchmarking data for optical chemical structure recognition on arrow-pushing mechanistic diagrams by providing 453 annotated images with mechanistic arrows. The dataset is curated from existing reaction-image collections, rendered as molecular graphs via ASKCOS, and annotated with canonical SMILES and RDKit-generated SDFs, with FAIR deposition on Zenodo under CC-BY 4.0. Preliminary benchmarking demonstrates that state-of-the-art OCSR models experience reduced accuracy on mechanistic images compared to standard datasets, underscoring the need for improved recognition methods. Overall, SMiCRM enables standardized evaluation of mechanistic molecular image understanding and aims to advance machine reading of complex chemical diagrams.

Abstract

Optical chemical structure recognition (OCSR) systems aim to extract the molecular structure information, usually in the form of molecular graph or SMILES, from images of chemical molecules. While many tools have been developed for this purpose, challenges still exist due to different types of noises that might exist in the images. Specifically, we focus on the 'arrow-pushing' diagrams, a typical type of chemical images to demonstrate electron flow in mechanistic steps. We present Structural molecular identifier of Molecular images in Chemical Reaction Mechanisms (SMiCRM), a dataset designed to benchmark machine recognition capabilities of chemical molecules with arrow-pushing annotations. Comprising 453 images, it spans a broad array of organic chemical reactions, each illustrated with molecular structures and mechanistic arrows. SMiCRM offers a rich collection of annotated molecule images for enhancing the benchmarking process for OCSR methods. This dataset includes a machine-readable molecular identity for each image as well as mechanistic arrows showing electron flow during chemical reactions. It presents a more authentic and challenging task for testing molecular recognition technologies, and achieving this task can greatly enrich the mechanisitic information in computer-extracted chemical reaction data.

SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images

TL;DR

SMiCRM addresses the limited benchmarking data for optical chemical structure recognition on arrow-pushing mechanistic diagrams by providing 453 annotated images with mechanistic arrows. The dataset is curated from existing reaction-image collections, rendered as molecular graphs via ASKCOS, and annotated with canonical SMILES and RDKit-generated SDFs, with FAIR deposition on Zenodo under CC-BY 4.0. Preliminary benchmarking demonstrates that state-of-the-art OCSR models experience reduced accuracy on mechanistic images compared to standard datasets, underscoring the need for improved recognition methods. Overall, SMiCRM enables standardized evaluation of mechanistic molecular image understanding and aims to advance machine reading of complex chemical diagrams.

Abstract

Optical chemical structure recognition (OCSR) systems aim to extract the molecular structure information, usually in the form of molecular graph or SMILES, from images of chemical molecules. While many tools have been developed for this purpose, challenges still exist due to different types of noises that might exist in the images. Specifically, we focus on the 'arrow-pushing' diagrams, a typical type of chemical images to demonstrate electron flow in mechanistic steps. We present Structural molecular identifier of Molecular images in Chemical Reaction Mechanisms (SMiCRM), a dataset designed to benchmark machine recognition capabilities of chemical molecules with arrow-pushing annotations. Comprising 453 images, it spans a broad array of organic chemical reactions, each illustrated with molecular structures and mechanistic arrows. SMiCRM offers a rich collection of annotated molecule images for enhancing the benchmarking process for OCSR methods. This dataset includes a machine-readable molecular identity for each image as well as mechanistic arrows showing electron flow during chemical reactions. It presents a more authentic and challenging task for testing molecular recognition technologies, and achieving this task can greatly enrich the mechanisitic information in computer-extracted chemical reaction data.
Paper Structure (5 sections, 3 figures, 1 table)

This paper contains 5 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Demonstration of a typical noised molecule image in chemical reaction mechanisms. It contains 4 types of contamination: intra- and inter-molecular curved arrows, partial charges and reaction arrows.
  • Figure 2: Only the highlighted molecule is documented for this reaction: The second molecule is an aliphatic molecule containing more functional groups, the forth one is an aromatic molecule containing more functional groups. Both molecules are chosen as they have curved arrows. While the first and the forth molecule are relatively simple in structure, and the last molecule does not have any curved arrows on.
  • Figure 3: Demonstration of extraction of molecular information from images. Identities of abbreviations are revealed, curved arrows are ignored, and partial charges are included in the molecule's identity.