Table of Contents
Fetching ...

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Wenrui Zhang, Xinggang Wang, Bin Feng, Wenyu Liu

TL;DR

MolSight tackles Optical Chemical Structure Recognition with a focus on stereochemistry by introducing a three-stage learning framework: SMILES-based pretraining on large-scale noisy data, multi-granularity fine-tuning with chemical bond and coordinate heads, and reinforcement learning on stereo-focused data to optimize semantic correctness. The approach achieves state-of-the-art performance on both real and synthetic benchmarks, with notable gains in stereochemical recognition and robust transfer to molecular property tasks. These advances enhance automated chemical data extraction for drug discovery and large-scale cheminformatics applications. The work also provides a new stereo-focused dataset and demonstrates the practical impact of combining perception, structured auxiliary tasks, and RL for complex image-to-structure translation.

Abstract

Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

TL;DR

MolSight tackles Optical Chemical Structure Recognition with a focus on stereochemistry by introducing a three-stage learning framework: SMILES-based pretraining on large-scale noisy data, multi-granularity fine-tuning with chemical bond and coordinate heads, and reinforcement learning on stereo-focused data to optimize semantic correctness. The approach achieves state-of-the-art performance on both real and synthetic benchmarks, with notable gains in stereochemical recognition and robust transfer to molecular property tasks. These advances enhance automated chemical data extraction for drug discovery and large-scale cheminformatics applications. The work also provides a new stereo-focused dataset and demonstrates the practical impact of combining perception, structured auxiliary tasks, and RL for complex image-to-structure translation.

Abstract

Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

Paper Structure

This paper contains 35 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples of challenging chemical structures images. (a) Diversity of images from different sources. (b) 3D molecular information encoded within 2D images.
  • Figure 2: Examples of how SMILES-M express Markush structures. Our SMILES-M can deal with all types of Markush structures, including type changes, location changes, and frequency changes.
  • Figure 3: Overall pipeline of Molsight. Given a chemical structure image, the image encoder extracts and fuses multi-level image features, which will be fed into the SMILES decoder with previous SMILES tokens to predict the next SMILES token. The SMILES head maps the output logits into SMILES vocabulary space, while the bond head and the coord head predict the chemical bond type and location of each atom token, respectively. The residual connections are omitted in this figure.
  • Figure 4: Comparison of Training Paradigms. (a) Imitation Learning: due to the diversity of correct SMILES text, token-level optimization may cause ambiguous optimize direction. (b) Reinforcement Learning: multiple completions are sampled at once, then scored on Tanimoto similarity and structural consistency. We use GRPO to achieve this trajectory-level optimization.
  • Figure 5: Reward curve during the training process. The total reward rises steadily throughout post training. Left: training with the weighted reward function as shown in Alg. 1, Middle: training with the Tanimoto Similarity Reward alone, Right: training with the Stereochemistry Reward alone.
  • ...and 2 more figures