Table of Contents
Fetching ...

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

TL;DR

Factcheck-Bench introduces a fine-grained, end-to-end framework for annotating and evaluating the factuality of open-domain LLM outputs across eight subtasks: decomposition, decontextualisation, check-worthiness, evidence retrieval, stance, correction, editing, and final revision. The authors construct a document-level factuality benchmark with rich annotations (claims, evidence, stances, and revised content) from 94 ChatGPT/GPT-4 instances, enabling evaluation of intermediate components of fact-checking pipelines. Experimental results show current automatic checkers (e.g., FactScore, FacTool, Perplexity.ai) struggle to identify false claims, with the best F1 around 0.63, underscoring substantial room for improvement. The work also develops an annotation tool and open-sources data and code, discusses limitations (small scale, inter-claim dependencies, evidence quality, biases, and cost), and outlines directions for scaling, improving evidence retrieval, and aligning evaluation metrics with human judgments.

Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

TL;DR

Factcheck-Bench introduces a fine-grained, end-to-end framework for annotating and evaluating the factuality of open-domain LLM outputs across eight subtasks: decomposition, decontextualisation, check-worthiness, evidence retrieval, stance, correction, editing, and final revision. The authors construct a document-level factuality benchmark with rich annotations (claims, evidence, stances, and revised content) from 94 ChatGPT/GPT-4 instances, enabling evaluation of intermediate components of fact-checking pipelines. Experimental results show current automatic checkers (e.g., FactScore, FacTool, Perplexity.ai) struggle to identify false claims, with the best F1 around 0.63, underscoring substantial room for improvement. The work also develops an annotation tool and open-sources data and code, discusses limitations (small scale, inter-claim dependencies, evidence quality, biases, and cost), and outlines directions for scaling, improving evidence retrieval, and aligning evaluation metrics with human judgments.

Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.
Paper Structure (75 sections, 13 figures, 11 tables)

This paper contains 75 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Left: Factuality annotation pipeline for LLMs outputs. Right: An example workflow.
  • Figure 2: Claim analysis: (1) whether raters can determine the factuality of a claim depending on the automatically-collected evidence (Yes/No); (2) does the evidence support the claim (CP: completely support, PS: partially support, RE: refute, IR: irrelevant); (3) does the claim need to be corrected. NA (17) refers to 16 opinion-claims + 1 not-a-claim.
  • Figure 3: FactScore distribution for three component sources and their combination.
  • Figure 4: Sentence analysis: (1) Distribution of the number of sentences for each response; (2) Importance degree of sentences to answer the question (The distribution of the most important sentences to answer the question, intermediate important and not important; (3) The number of sentences across four types in terms of whether the sentence contains statements requiring fact-checking, Not_claim refers to not a claim, such as a question.
  • Figure 5: The distribution of component atomic claims amount given a sentence.
  • ...and 8 more figures