Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?
Antonín Jarolím, Martin Fajčík, Lucia Makaiová
TL;DR
The paper tackles the problem of extracting fine-grained, verifiable evidence spans from source texts to support or refute claims in online discussions about news articles in Czech and Slovak. It introduces a manually annotated dataset with two independent span annotations per sample and evaluates a wide range of LLMs and baselines on the span extraction task, using a prompt that requests the smallest verbatim spans and a JSON list of spans, with token-level F1 and Hungarian matching for assessment. Findings show that model size yields diminishing returns and that many large models still generate invalid spans, while smaller models can be competitive; deepseek-r1:32b and qwen3:14b achieve the strongest alignment with human annotations (≈55–56 token-F1), occasionally surpassing inter-annotator agreement on a given annotation scheme. The work provides a new resource for alignment studies in Czech/Slovak and highlights the need for constrained decoding to improve reliability, with practical trade-offs identified for 14B qwen3, 32B deepseek-r1, and 20B gpt-oss as favorable options.
Abstract
Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.
