Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers
Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu
TL;DR
This study systematically scrutinizes fact verifiers by evaluating 12 pre-trained LLMs plus a specialized verifier across 14 benchmarks, highlighting how dataset annotation ambiguity and mislabeled examples can distort model rankings. It uncovers that frontier LLMs with few-shot prompting are strong yet often overlooked baselines, while small fine-tuned verifiers struggle on complex multi-hop cases unless augmented with synthetic reasoning data. To address data quality issues, the authors construct ClearFacts and GrayFacts, refining benchmarks and enabling targeted analysis of model behavior on ambiguous cases. They further demonstrate that synthetic multi-hop data substantially boosts small verifiers, offering a scalable route to efficient, robust factuality assessment for real-world LLM applications.
Abstract
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend that future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers.
