Table of Contents
Fetching ...

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu

TL;DR

This study systematically scrutinizes fact verifiers by evaluating 12 pre-trained LLMs plus a specialized verifier across 14 benchmarks, highlighting how dataset annotation ambiguity and mislabeled examples can distort model rankings. It uncovers that frontier LLMs with few-shot prompting are strong yet often overlooked baselines, while small fine-tuned verifiers struggle on complex multi-hop cases unless augmented with synthetic reasoning data. To address data quality issues, the authors construct ClearFacts and GrayFacts, refining benchmarks and enabling targeted analysis of model behavior on ambiguous cases. They further demonstrate that synthetic multi-hop data substantially boosts small verifiers, offering a scalable route to efficient, robust factuality assessment for real-world LLM applications.

Abstract

Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend that future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers.

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

TL;DR

This study systematically scrutinizes fact verifiers by evaluating 12 pre-trained LLMs plus a specialized verifier across 14 benchmarks, highlighting how dataset annotation ambiguity and mislabeled examples can distort model rankings. It uncovers that frontier LLMs with few-shot prompting are strong yet often overlooked baselines, while small fine-tuned verifiers struggle on complex multi-hop cases unless augmented with synthetic reasoning data. To address data quality issues, the authors construct ClearFacts and GrayFacts, refining benchmarks and enabling targeted analysis of model behavior on ambiguous cases. They further demonstrate that synthetic multi-hop data substantially boosts small verifiers, offering a scalable route to efficient, robust factuality assessment for real-world LLM applications.

Abstract

Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend that future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers.

Paper Structure

This paper contains 49 sections, 11 figures, 21 tables.

Figures (11)

  • Figure 1: Detecting label errors and ambiguous instances in fact verification benchmarks. First, we run four fact verifiers on the remaining instances, and instances correctly classified by fact verifiers become part of ClearFacts. For instances that are misclassified, three LLM judges evaluate the verifier outputs. If at least one judge flags an output as incorrect, the corresponding instance also become part of ClearFacts. The remaining cases are manually annotated: instances identified as ambiguous form GrayFacts, while the rest are added to ClearFacts, with label corrections applied if necessary.
  • Figure 2: We identify four types of label ambiguity in the benchmarks, and excluding those ambiguous examples when we build ClearFacts to improve the reliability of model evals. This is an example from the fact verification benchmark (AggreFact-CNN; tang2022understanding) with knowledge-level and contextual ambiguity. We primarily classified this example as knowledge-level ambiguity, but we later noticed that there could be multiple reasons for ambiguity. The model identifies Red Devils and United as synonymous, leading it to classify the statement as attributable to the document. The rationale of the model is also reasonable and faithful --- based on the context, United can be referred to Manchester United, and when we search https://www.google.com/search?q=red+devils on Google, it shows Manchester United. On the other hand, human annotator, unaware of this equivalence, might not reach the same conclusion.
  • Figure 3: Finding 2: Few-shot prompting significantly improves the performance of LLM-as-fact-verifiers. We report macro F1 scores on ClearFacts using MiniCheck and 12 LLMs under both zero-shot and few-shot settings. For each setup, the same prompt was used consistently across all models.
  • Figure 4: Interface provided to annotators to identify the label errors and ambiguity (first page).
  • Figure 5: Interface provided to annotators to identify the label errors and ambiguity (second page).
  • ...and 6 more figures