Table of Contents
Fetching ...

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan

Abstract

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Abstract

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
Paper Structure (53 sections, 13 figures, 8 tables)

This paper contains 53 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The Faithfulness-Realism Dilemma in scientific data synthesis and our proposed solution. Existing approaches face an inherent trade-off: simplifying context ensures faithfulness but lacks real-world complexity, while generating directly from full documents ensures realism but risks hallucination. We resolve this by decoupling the objectives into a two-stage synthesize-and-reground framework. By first generating verified QA pairs on atomic contexts and subsequently re-embedding them into full-document tasks, we achieve a dataset that simultaneously satisfies Scale, Faithfulness, and Realism.
  • Figure 2: Overview of the synthesize-and-reground framework. The pipeline operates in two stages: Claim-Centric QA Synthesis ensures faithfulness by extracting atomic claims and employing backward reasoning to generate QA pairs with chain-of-thought; Document-Scale Re-grounding ensures realism by re-embedding these pairs into full-document contexts and injecting information localization steps to create hard training instances.
  • Figure 3: TQA generation prompt. This prompt generates questions testing deep understanding of scientific content without visual evidence.
  • Figure 4: LLM judge prompt. This prompt evaluates model responses based on text citation (0.30), image citation (0.30), and answer accuracy (0.40).
  • Figure 5: Claim extraction prompt. This prompt guides the LLM to distill paragraphs into structured, verifiable claims serving as blueprints for QA generation.
  • ...and 8 more figures