Table of Contents
Fetching ...

MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

Lucas Morin, Valéry Weber, Ahmed Nassar, Gerhard Ingmar Meijer, Luc Van Gool, Yawei Li, Peter Staar

TL;DR

MarkushGrapher addresses the challenge of jointly recognizing Markush structure backbones and their variable substituents in patent documents by fusing a Vision-Text-Layout encoder with an Optical Chemical Structure Recognition encoder to autoregressively generate a CXSMILES backbone and a substituent table. A synthetic data generation pipeline paired with a real-world annotated benchmark (M2S) enables robust learning, while results on synthetic, M2S, and USPTO-Markush datasets show state-of-the-art performance against chemistry-specific and general vision-language models. The work introduces the M2S dataset and demonstrates the model’s ability to handle complex multi-modal Markush features, including R-groups, frequency, and position variation indicators. This approach enables scalable extraction and searchability of Markush structures from patent literature, with potential to power large-scale Markush databases for prior-art search and landscape analysis; code, models, and datasets are to be released to the community.

Abstract

The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.

MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

TL;DR

MarkushGrapher addresses the challenge of jointly recognizing Markush structure backbones and their variable substituents in patent documents by fusing a Vision-Text-Layout encoder with an Optical Chemical Structure Recognition encoder to autoregressively generate a CXSMILES backbone and a substituent table. A synthetic data generation pipeline paired with a real-world annotated benchmark (M2S) enables robust learning, while results on synthetic, M2S, and USPTO-Markush datasets show state-of-the-art performance against chemistry-specific and general vision-language models. The work introduces the M2S dataset and demonstrates the model’s ability to handle complex multi-modal Markush features, including R-groups, frequency, and position variation indicators. This approach enables scalable extraction and searchability of Markush structures from patent literature, with potential to power large-scale Markush databases for prior-art search and landscape analysis; code, models, and datasets are to be released to the community.

Abstract

The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.

Paper Structure

This paper contains 32 sections, 3 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: MarkushGrapher extracts Markush structures from documents using their visual and textual definitions.
  • Figure 2: Markush structure recognition architecture. MarkushGrapher jointly encodes the input image and its text with a VTL encoder (blue) and an OCSR encoder (red). The VTL output ($e_{1}$) and the OCSR output ($e_{2}$) are concatenated. Finally, this joint encoding is processed with a text decoder to predict a sequential representation of the Markush backbone (purple) and its substituent table (orange).
  • Figure 3: Optimized CXSMILES format. The figure presents the steps of the CXSMILES optimization. The CXSMILES (1) is first compacted by moving variable groups in the SMILES sequence and removing unnecessary characters (2). Then, the indices of atoms are appended after each atom (between $<i>$ and $</i>$ tokens) and the sequence is encoded using a specific vocabulary for atoms and bonds ($<chem>$ tokens) (3).
  • Figure 4: Synthetic training data generation. The figure presents the pipeline for generating synthetic training samples. First, a molecule is sampled from PubChem and augmented to create a CXSMILES. Second, the CXSMILES is used to jointly generate a image of the Markush backbone and its OCR cells (red), and generate an image of a text description and its OCR cells (green) . Finally, images are collated to create a training sample.
  • Figure 5: Qualitative comparison. Examples of predictions are shown for the different MMSR models on real-world data (M2S, USPTO-Markush, SciAssess) and on synthetic data (MarkushGrapher-Synthetic).
  • ...and 8 more figures