Table of Contents
Fetching ...

SciClaimEval: Cross-modal Claim Verification in Scientific Papers

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Tian Cheng Xia, Florian Boudin, Andre Greiner-Petter, Akiko Aizawa

TL;DR

SciClaimEval addresses the need for realistic, cross-modal scientific claim verification by compiling an authentic dataset of 1,664 claim–evidence pairs from 180 papers across ML, NLP, and medicine. It introduces evidence-modification as a strategy to generate negative examples and provides rich representations for tables and figures, including LaTeX, HTML, JSON, and PNG formats. Benchmarking 11 multimodal foundation models reveals that figure-based verification is notably challenging and significantly lagging human performance, while table-based evaluation is more tractable for open-source models. The dataset enables rigorous evaluation with macro-F1 and the stricter Pair Accuracy metric, highlighting gaps and guiding future advances in trustworthy multimodal scientific reasoning. Overall, SciClaimEval bridges realism and multimodal diversity to促omote progress in automated scientific claim verification.

Abstract

We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.

SciClaimEval: Cross-modal Claim Verification in Scientific Papers

TL;DR

SciClaimEval addresses the need for realistic, cross-modal scientific claim verification by compiling an authentic dataset of 1,664 claim–evidence pairs from 180 papers across ML, NLP, and medicine. It introduces evidence-modification as a strategy to generate negative examples and provides rich representations for tables and figures, including LaTeX, HTML, JSON, and PNG formats. Benchmarking 11 multimodal foundation models reveals that figure-based verification is notably challenging and significantly lagging human performance, while table-based evaluation is more tractable for open-source models. The dataset enables rigorous evaluation with macro-F1 and the stricter Pair Accuracy metric, highlighting gaps and guiding future advances in trustworthy multimodal scientific reasoning. Overall, SciClaimEval bridges realism and multimodal diversity to促omote progress in automated scientific claim verification.

Abstract

We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.
Paper Structure (35 sections, 4 figures, 8 tables)

This paper contains 35 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Our dataset construction pipeline consists of three main steps: data preparation, automatic claim-evidence extraction, and human annotation (Subsections \ref{['subsec_data_prepare']}, \ref{['subsec_claim_extract']}, and \ref{['subsec_human_annotate']}). The human annotation process involves two tasks: claim-evidence verification and evidence modification. After collecting all samples, we design two subtasks in our dataset: claim-label prediction and claim-evidence prediction. Details are in \ref{['subsec_task_design']}.
  • Figure 2: Analyses of evidence-modifying operations in SciClaimEval.
  • Figure 3: An example of an others modification in the figure evidence from our dataset involves creating an unsupported claim by adding spurious data points. The annotator labels this operation as others, with the specific detail noted as "adding fake data points." The context provides the necessary information to understand the plot in the evidence.
  • Figure 4: Violin plots showing the distribution of the Structural Similarity Index wang2004ssim across five operation types. In each plot, the black dashed line indicates the group mean, and the numeric label to the right of the line denotes the corresponding average value.