Table of Contents
Fetching ...

EX-FEVER: A Dataset for Multi-hop Explainable Fact Verification

Huanhuan Ma, Weizhi Xu, Yifan Wei, Liuji Chen, Liang Wang, Qiang Liu, Shu Wu, Liang Wang

TL;DR

EX-FEVER addresses the gap in explainable, multi-hop fact verification by providing a large-scale dataset with 60k claims and per-hop explanations grounded in Wikipedia. The authors build a three-stage baseline system (document retrieval, abstractive explanation generation, verdict prediction) and conduct extensive experiments, including prompt-based explorations of LLMs. Key findings show that retrieval quality is a bottleneck, with MDR and BERT-based methods offering complementary strengths, and that LLMs can contribute notably as planners and explanation generators. Overall, EX-FEVER offers a valuable benchmark for developing reliable, interpretable multi-hop fact-checking systems and reveals promising directions for integrating LLMs into explainable verification tasks.

Abstract

Fact verification aims to automatically probe the veracity of a claim based on several pieces of evidence. Existing works are always engaging in accuracy improvement, let alone explainability, a critical capability of fact verification systems. Constructing an explainable fact verification system in a complex multi-hop scenario is consistently impeded by the absence of a relevant, high-quality dataset. Previous datasets either suffer from excessive simplification or fail to incorporate essential considerations for explainability. To address this, we present EXFEVER, a pioneering dataset for multi-hop explainable fact verification. With over 60,000 claims involving 2-hop and 3-hop reasoning, each is created by summarizing and modifying information from hyperlinked Wikipedia documents. Each instance is accompanied by a veracity label and an explanation that outlines the reasoning path supporting the veracity classification. Additionally, we demonstrate a novel baseline system on our EX-FEVER dataset, showcasing document retrieval, explanation generation, and claim verification, and validate the significance of our dataset. Furthermore, we highlight the potential of utilizing Large Language Models in the fact verification task. We hope our dataset could make a significant contribution by providing ample opportunities to explore the integration of natural language explanations in the domain of fact verification.

EX-FEVER: A Dataset for Multi-hop Explainable Fact Verification

TL;DR

EX-FEVER addresses the gap in explainable, multi-hop fact verification by providing a large-scale dataset with 60k claims and per-hop explanations grounded in Wikipedia. The authors build a three-stage baseline system (document retrieval, abstractive explanation generation, verdict prediction) and conduct extensive experiments, including prompt-based explorations of LLMs. Key findings show that retrieval quality is a bottleneck, with MDR and BERT-based methods offering complementary strengths, and that LLMs can contribute notably as planners and explanation generators. Overall, EX-FEVER offers a valuable benchmark for developing reliable, interpretable multi-hop fact-checking systems and reveals promising directions for integrating LLMs into explainable verification tasks.

Abstract

Fact verification aims to automatically probe the veracity of a claim based on several pieces of evidence. Existing works are always engaging in accuracy improvement, let alone explainability, a critical capability of fact verification systems. Constructing an explainable fact verification system in a complex multi-hop scenario is consistently impeded by the absence of a relevant, high-quality dataset. Previous datasets either suffer from excessive simplification or fail to incorporate essential considerations for explainability. To address this, we present EXFEVER, a pioneering dataset for multi-hop explainable fact verification. With over 60,000 claims involving 2-hop and 3-hop reasoning, each is created by summarizing and modifying information from hyperlinked Wikipedia documents. Each instance is accompanied by a veracity label and an explanation that outlines the reasoning path supporting the veracity classification. Additionally, we demonstrate a novel baseline system on our EX-FEVER dataset, showcasing document retrieval, explanation generation, and claim verification, and validate the significance of our dataset. Furthermore, we highlight the potential of utilizing Large Language Models in the fact verification task. We hope our dataset could make a significant contribution by providing ample opportunities to explore the integration of natural language explanations in the domain of fact verification.
Paper Structure (22 sections, 4 figures, 7 tables)

This paper contains 22 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: A sample in the proposed dataset EX-FEVER. The textual explanation in different colors refers to the information in different documents.
  • Figure 2: The baseline system comprises three stages: document retrieval, summary generation as explanations, and verdict prediction. The system produces two main outputs: a veracity label indicating whether the claim is 'SUPPORT'ed, 'REFUTE'd, or there is 'NOT ENOUGH INFO', and a summary that serves as an explanation for the prediction.
  • Figure 3: A sample in the proposed dataset EX-FEVER. The corresponding claim is "A Thousand Suns is an album dealing with human fears such as nuclear warfare, where the theme of the album was subsequently popularized by a traditional pop/jazz American singer and actor"
  • Figure 4: Annotation platform