Table of Contents
Fetching ...

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie, Bowen Jiang, Xingjian Wei, Junyuan Gao, Yubin Wang, Bin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He

Abstract

Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

Abstract

Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.
Paper Structure (45 sections, 1 equation, 10 figures, 11 tables)

This paper contains 45 sections, 1 equation, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of the Identifier as Visual Prompting (IdtVP) strategy. Left: VLMs learn visual-text alignment from identifier-rich interleaved literature during pre-training. Middle: IdtVP inputs diagrams annotated with molecule identifiers to activate this pre-trained knowledge. Right: The model outputs structured reactions, directly using these identifiers as "handles" to concisely represent complex molecules.
  • Figure 2: Attention heatmaps of different prompting strategies. The top panels illustrate the input images, textual prompts, and corresponding output formats for each strategy. The bottom panels visualize the Text-to-Image attention heatmaps, computed by averaging the attention weights across all heads per layer, followed by average pooling across all layers. Notably, IdtVP yields far more precise and comprehensive visual grounding on molecules and text compared to BROS and BIVP.
  • Figure 3: Overview of the proposed framework. Phase 1 constructs IdtVP data effectively supporting both zero-shot inference and model training. Phases 2-3 introduce a generalized optimization paradigm (SFT followed by Re$^3$-DAPO) that is transferable to other prompting variants (e.g., BIVP).
  • Figure 4: Reward $\Delta$ change on the val dataset. The Y-axis denotes the reward improvement relative to the initial step (Step 0).
  • Figure 5: Overview of the Cross-Modal Verification pipeline. System processes double-stream inputs: visual parsing of raw diagrams via RxnID (top) and textual extraction from the manuscript via Idt-TE (bottom), enabling downstream applications such as Precision Refinement (Case A) and Contextual Enrichment (Case B).
  • ...and 5 more figures