Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents
Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai
TL;DR
The paper tackles automated SAR extraction from heterogeneous scientific documents, a task hampered by rigid rule-based methods and brittle end-to-end LLMs. It introduces DocSAR-200, a carefully annotated 200-document benchmark, and Doc2SAR, a modular framework that fuses domain-specific tools (OCSR, molecular coreference) with supervised fine-tuning to solve complex sub-tasks. Experiments show Doc2SAR achieves a state-of-the-art Table Recall of $80.78\%$ on DocSAR-200, vastly outperforming end-to-end baselines (e.g., $29.30\%$ for GPT-4o) and capable of processing over 100 PDFs per hour with a web app for visualization and refinement. The work demonstrates robust cross-page and multilingual SAR extraction, offering a practical pipeline to accelerate drug discovery and materials research.
Abstract
Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.
