Table of Contents
Fetching ...

SciClaimHunt: A Large Dataset for Evidence-based Scientific Claim Verification

Sujit Kumar, Anshul Sharma, Siddharth Hemant Khincha, Gargi Shroff, Sanasam Ranbir Singh, Rahul Mishra

TL;DR

Scientific claim verification remains challenging due to domain-specific evidence and numerical precision. This work introduces two large-scale, evidence-grounded datasets, SciClaimHunt and SciClaimHunt_Num, generated from scientific papers and paired with baseline verification methods. It proposes retrieval-based and generation-based baselines (ER, RAG, GCEM, CEM) and assesses claim quality through human annotations, demonstrating cross-dataset generalization to SCIFACT/SCIFACT-OPEN. The findings establish SciClaimHunt and SciClaimHunt_Num as reliable resources for training and benchmarking evidence-based scientific claim verification across domains.

Abstract

Verifying scientific claims presents a significantly greater challenge than verifying political or news-related claims. Unlike the relatively broad audience for political claims, the users of scientific claim verification systems can vary widely, ranging from researchers testing specific hypotheses to everyday users seeking information on a medication. Additionally, the evidence for scientific claims is often highly complex, involving technical terminology and intricate domain-specific concepts that require specialized models for accurate verification. Despite considerable interest from the research community, there is a noticeable lack of large-scale scientific claim verification datasets to benchmark and train effective models. To bridge this gap, we introduce two large-scale datasets, SciClaimHunt and SciClaimHunt_Num, derived from scientific research papers. We propose several baseline models tailored for scientific claim verification to assess the effectiveness of these datasets. Additionally, we evaluate models trained on SciClaimHunt and SciClaimHunt_Num against existing scientific claim verification datasets to gauge their quality and reliability. Furthermore, we conduct human evaluations of the claims in proposed datasets and perform error analysis to assess the effectiveness of the proposed baseline models. Our findings indicate that SciClaimHunt and SciClaimHunt_Num serve as highly reliable resources for training models in scientific claim verification.

SciClaimHunt: A Large Dataset for Evidence-based Scientific Claim Verification

TL;DR

Scientific claim verification remains challenging due to domain-specific evidence and numerical precision. This work introduces two large-scale, evidence-grounded datasets, SciClaimHunt and SciClaimHunt_Num, generated from scientific papers and paired with baseline verification methods. It proposes retrieval-based and generation-based baselines (ER, RAG, GCEM, CEM) and assesses claim quality through human annotations, demonstrating cross-dataset generalization to SCIFACT/SCIFACT-OPEN. The findings establish SciClaimHunt and SciClaimHunt_Num as reliable resources for training and benchmarking evidence-based scientific claim verification across domains.

Abstract

Verifying scientific claims presents a significantly greater challenge than verifying political or news-related claims. Unlike the relatively broad audience for political claims, the users of scientific claim verification systems can vary widely, ranging from researchers testing specific hypotheses to everyday users seeking information on a medication. Additionally, the evidence for scientific claims is often highly complex, involving technical terminology and intricate domain-specific concepts that require specialized models for accurate verification. Despite considerable interest from the research community, there is a noticeable lack of large-scale scientific claim verification datasets to benchmark and train effective models. To bridge this gap, we introduce two large-scale datasets, SciClaimHunt and SciClaimHunt_Num, derived from scientific research papers. We propose several baseline models tailored for scientific claim verification to assess the effectiveness of these datasets. Additionally, we evaluate models trained on SciClaimHunt and SciClaimHunt_Num against existing scientific claim verification datasets to gauge their quality and reliability. Furthermore, we conduct human evaluations of the claims in proposed datasets and perform error analysis to assess the effectiveness of the proposed baseline models. Our findings indicate that SciClaimHunt and SciClaimHunt_Num serve as highly reliable resources for training models in scientific claim verification.

Paper Structure

This paper contains 19 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The left sub-figure shows an example of a negative scientific claim involving numerals and cardinal numbers from SciClaimHunt_Num, refuted by evidence extracted from a research paper. The right sub-figure showcases a positive scientific claim from the SciClaimHunt dataset, supported by evidence retrieved from the research paper.
  • Figure 2: Evaluation Criteria For Claims
  • Figure 3: presents a working diagram of the proposed Retrieval-Augmented Generation-based approach for scientific claim validation. Given a research paper $\mathcal{R}$ and a claim $\mathcal{C}$, the research paper $\mathcal{R}$ is first split into a set of passage $\mathcal{S}_i$. Next, the similarity $\boldsymbol{\alpha}_i$ between the encoded representations $\mathbf{s}_i$ and $\mathbf{c}$ of passage $\mathcal{S}_i$ and the claim $\mathcal{C}$, respectively, is computed. Evidence $\mathcal{E}$ is then obtained by selecting the top k passage $\mathcal{S}_i$ with the highest similarity scores $\boldsymbol{\alpha}_i$. This evidence $\mathcal{E}$ along with prompt instruction is passed to the Gemma model to generate a fact $\mathcal{E}^s$. Given the claim $\mathcal{C}$ and the generated fact $\mathcal{E}^s$, this study adopts two different approaches for scientific claim validations: (i) fine-tuning BERT and RoBERTa, and (ii) instruction tuning with Llama.
  • Figure 4: present the Claim-Evidence Matching using Multi-Head Attention (CEM) for scientific claim verification. First, the claim $\mathcal{C}_j$ and sentences $\mathcal{S}_i$ are encoded using S-BERT to obtain the encoded representations $\mathbf{c}_j$ for the claim $\mathcal{C}_j$ and $\mathbf{s}_i$ for the $i^{th}$ sentence in the evidence set. Next, we construct an evidence representation matrix $\mathbf{U}^e$ by stacking the encoded representations $\mathbf{s}_i$ of all sentences in the evidence set, where each row of $\mathbf{U}^e$ corresponds to the encoded representation of an individual sentence. Subsequently, we apply multi-head attention between the claim and the evidence set, using the encoded representation of the claim $\mathbf{c}_j$ as the query and the evidence matrix $\mathbf{U}^e$ as both the key and value, to obtain the evidence representation vector $\mathbf{v}$ based on the similarity between the claim and the evidence. Finally, we estimate the similarity and difference feature vector between $\mathbf{C}_j$ and $\mathbf{v}$, which is passed through two fully connected layers for classification.
  • Figure 5: presents attention heatmaps showing the attention weights between the claim and various sentences of the evidence for the Refutes class (negative claims) samples. The heatmaps reveal that the multi-head attention component of the CEM model assigns moderate attention weights to sentences that are just related and neural to the claim and lower weights to sentences with minimal relevance to the claim. The darker colour signifies the higher attention weight assigned to the respective sentence from the evidence set and vice-versa.
  • ...and 3 more figures