Table of Contents
Fetching ...

VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, Yaliang Li

TL;DR

The paper introduces VeriSciQA, a large-scale, auto-verified SVQA dataset created with a Generate-then-Verify framework that grounds QA generation in figure-associated paragraphs. By applying cascaded text- and vision-based verification, it mitigates common LVLM-generated errors and ensures visual grounding and contextual relevance. The dataset spans diverse domains, figure types, and question styles, and experiments show significant improvements when VeriSciQA data is used for fine-tuning open-source models, with gains scaling with data size. Ablation studies demonstrate the critical role of cross-modal verification. Collectively, the work establishes a scalable pathway to improve open-source SVQA through verified data synthesis grounded in scholarly context.

Abstract

Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.

VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

TL;DR

The paper introduces VeriSciQA, a large-scale, auto-verified SVQA dataset created with a Generate-then-Verify framework that grounds QA generation in figure-associated paragraphs. By applying cascaded text- and vision-based verification, it mitigates common LVLM-generated errors and ensures visual grounding and contextual relevance. The dataset spans diverse domains, figure types, and question styles, and experiments show significant improvements when VeriSciQA data is used for fine-tuning open-source models, with gains scaling with data size. Ablation studies demonstrate the critical role of cross-modal verification. Collectively, the work establishes a scalable pathway to improve open-source SVQA through verified data synthesis grounded in scholarly context.

Abstract

Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.

Paper Structure

This paper contains 51 sections, 4 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 1: Illustrative examples of the four common error categories (E1–E4).
  • Figure 2: Overview of our Generate-then-Verify framework as instantiated to curate VeriSciQA. We first extract figure-citing paragraphs from arXiv papers as figure-associated context. Generation stage then decomposes each context into atomic claims and produces QA pairs only from claims with concrete visual grounding; Verification stage applies cascaded text-based and vision-based filtering to remove erroneous candidates.
  • Figure 3: Dataset Statistics. Dataset composition across three dimensions: (a) scientific domain distribution across 20 arXiv categories, (b) figure type coverage spanning 12 categories, and (c) question type distribution across five question types. Long-tail categories are grouped as "Others"; complete breakdowns are provided in Appendix \ref{['app:taxonomy']}.
  • Figure 4: Dataset Difficulty Assessment. Model accuracy across (a) scientific domains, (b) figure types, and (c) question types. Even the strongest model (o3) achieves only 77% on Astro-Ph, 80% on Composite figures, and 79% on Compositional questions, revealing substantial challenges. Complete results in Appendix \ref{['app:detailed-results']}.
  • Figure 5: Performance gains when fine-tuning Qwen2.5-VL-7B on increasing amounts of our dataset.
  • ...and 17 more figures