Table of Contents
Fetching ...

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai

TL;DR

The paper tackles automated SAR extraction from heterogeneous scientific documents, a task hampered by rigid rule-based methods and brittle end-to-end LLMs. It introduces DocSAR-200, a carefully annotated 200-document benchmark, and Doc2SAR, a modular framework that fuses domain-specific tools (OCSR, molecular coreference) with supervised fine-tuning to solve complex sub-tasks. Experiments show Doc2SAR achieves a state-of-the-art Table Recall of $80.78\%$ on DocSAR-200, vastly outperforming end-to-end baselines (e.g., $29.30\%$ for GPT-4o) and capable of processing over 100 PDFs per hour with a web app for visualization and refinement. The work demonstrates robust cross-page and multilingual SAR extraction, offering a practical pipeline to accelerate drug discovery and materials research.

Abstract

Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

TL;DR

The paper tackles automated SAR extraction from heterogeneous scientific documents, a task hampered by rigid rule-based methods and brittle end-to-end LLMs. It introduces DocSAR-200, a carefully annotated 200-document benchmark, and Doc2SAR, a modular framework that fuses domain-specific tools (OCSR, molecular coreference) with supervised fine-tuning to solve complex sub-tasks. Experiments show Doc2SAR achieves a state-of-the-art Table Recall of on DocSAR-200, vastly outperforming end-to-end baselines (e.g., for GPT-4o) and capable of processing over 100 PDFs per hour with a web app for visualization and refinement. The work demonstrates robust cross-page and multilingual SAR extraction, offering a practical pipeline to accelerate drug discovery and materials research.

Abstract

Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.

Paper Structure

This paper contains 34 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Motivation for a modular framework: An end-to-end GPT-4o model shows compounding errors across crucial stages of SAR extraction.
  • Figure 2: A representative ground truth annotation from the DocSAR-200 benchmark, showing molecular structures linked to activity table entries across multiple pages, formatted as structured CSV records.
  • Figure 3: Overview of DocSAR-200. (a) File-level composition broken down by patents (blue) and literature (red), with further subdivision across journals and patent offices. (b) Distribution of activity table types within 2617 tables; the gray region indicates the proportion of irrelevant (non-activity) tables. (c) Molecular size distribution measured by atom count, including typical and very large molecules.
  • Figure 4: The Doc2SAR framework. The PDF is parsed into layout segments; molecular images and tables are processed in parallel. The framework's strength lies in its synergistic design, using specialized tools like OCSR and fine-tuned MLLMs for perception, and rule-based methods for logical integration.
  • Figure 5: Case study in ACS Medicinal Chemistry Letters 10.1021_acsmedchemlett.3c00392. Note that the red and blue boxes (masks) in the case are only for auxiliary visualization display and are not included in the document. Doc2SAR correctly performs intra-page association, linking a molecular structure to its activity table on the same page.
  • ...and 5 more figures