Claim Extraction for Fact-Checking: Data, Models, and Automated Metrics
Herbert Ullrich, Tomáš Mlynář, Jan Drchal
TL;DR
The paper tackles automated claim extraction for fact-checking by treating it as a one-to-many abstractive generation problem. It introduces FEVERFact, a dataset of 17K atomic claims drawn from 4.4K contextualized Wikipedia sentences, alongside an automated six-metric evaluation framework that includes Atomicity, Fluency, Decontextualization, Faithfulness, Focus, and Coverage, with $F_{fact}$—the harmonic mean of Focus and Coverage—being the centerpiece. It evaluates multiple generation approaches, including QACG baselines, LLM prompting, and T5-based transfer learning, showing that lightweight, trainable models with diverse-beam decoding can approach human-like claim quality while maintaining scalable evaluation. The work provides a reusable evaluation toolkit and demonstrates strong metric validation against human judgments, while discussing ethical considerations of deploying generative claim creators. Overall, FEVERFact and its evaluation framework offer a scalable, data-driven path for advancing check-worthy claim generation and its assessment in real-world misinformation detection tasks.
Abstract
In this paper, we explore the problem of Claim Extraction using one-to-many text generation methods, comparing LLMs, small summarization models finetuned for the task, and a previous NER-centric baseline QACG. As the current publications on Claim Extraction, Fact Extraction, Claim Generation and Check-worthy Claim Detection are quite scattered in their means and terminology, we compile their common objectives, releasing the FEVERFact dataset, with 17K atomic factual claims extracted from 4K contextualised Wikipedia sentences, adapted from the original FEVER. We compile the known objectives into an Evaluation framework of: Atomicity, Fluency, Decontextualization, Faithfulness checked for each generated claim separately, and Focus and Coverage measured against the full set of predicted claims for a single input. For each metric, we implement a scale using a reduction to an already-explored NLP task. We validate our metrics against human grading of generic claims, to see that the model ranking on $F_{fact}$, our hardest metric, did not change and the evaluation framework approximates human grading very closely in terms of $F_1$ and RMSE.
