Table of Contents
Fetching ...

MolGrapher: Graph-based Visual Recognition of Chemical Structures

Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valery Weber, Ingmar Meijer, Peter Staar, Fisher Yu

TL;DR

MolGrapher addresses Optical Chemical Structure Recognition by leveraging a graph-based pipeline: localizing atom keypoints, constructing a comprehensive supergraph of atom/bond candidates, and using a Graph Neural Network to classify nodes. A synthetic data generation pipeline and the USPTO-30K benchmark enable robust training and evaluation, demonstrating strong generalization to diverse drawings and large molecules without real-data fine-tuning. Across multiple benchmarks, MolGrapher outperforms rule-based and several deep-learning baselines, and shows notable resilience to image perturbations and captions. The work advances scalable, accurate OCSR and paves the way for large-scale digital molecule databases, while outlining future extensions to accommodate complex structures like markush representations.

Abstract

The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available.

MolGrapher: Graph-based Visual Recognition of Chemical Structures

TL;DR

MolGrapher addresses Optical Chemical Structure Recognition by leveraging a graph-based pipeline: localizing atom keypoints, constructing a comprehensive supergraph of atom/bond candidates, and using a Graph Neural Network to classify nodes. A synthetic data generation pipeline and the USPTO-30K benchmark enable robust training and evaluation, demonstrating strong generalization to diverse drawings and large molecules without real-data fine-tuning. Across multiple benchmarks, MolGrapher outperforms rule-based and several deep-learning baselines, and shows notable resilience to image perturbations and captions. The work advances scalable, accurate OCSR and paves the way for large-scale digital molecule databases, while outlining future extensions to accommodate complex structures like markush representations.

Abstract

The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available.
Paper Structure (31 sections, 5 equations, 16 figures, 11 tables)

This paper contains 31 sections, 5 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: MolGrapher extracts the chemical structure, including all atoms and bonds, from a molecule image in a document. Our approach constructs a supergraph of the molecule (bottom right), containing all detected atom and bond candidates. These nodes are then classified by a Graph Neural Network in order to retrieve the chemical structure.
  • Figure 2: Molecular graph recognition architecture. We illustrate the architecture of MolGrapher, a graph-based network for Optical Chemical Structure Recognition. The keypoint detector (red) locates atoms nodes in the molecule. A supergraph containing atom and bond candidates is constructed (green). Atoms and bonds are classified using a Graph Neural Network (blue).
  • Figure 3: Supergraph construction. The figure presents the construction of bonds proposals for an atom denoted $A$. Considered bonds are depicted with dashed lines. Green bonds are accepted in the supergraph, while red bonds are discarded because: (1) there are no filled pixels around their centerpoints or (2) they are obstructed by other keypoints.
  • Figure 4: Prediction steps. Keypoints are detected and then used to build a supergraph. After classifying the nodes of the graph and recognizing abbreviated groups, the output molecule is created. In this example, the polycyclic molecule contains overlapping bonds, a challenging feature for OCSR models, and is still correctly recognized.
  • Figure 5: Qualitative comparison. The figure shows examples of predictions for characteristic images from different benchmarks. Compared to previous rule-based and learning based methods, our approach robustly recognizes the exact molecular structure in challenging cases, such as with distracting captions, stereo-chemistry, and very large molecules.
  • ...and 11 more figures