FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation
Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, Lu Wang
TL;DR
FactBench introduces a dynamic, real-world factuality benchmark and the VERIFY pipeline to evaluate language models on their ability to produce verifiable outputs. VERIFY decomposes LM responses into content units, labels them against web evidence as supported, unsupported, or undecidable, and combines these into a Hallucination Score to rank prompts. FactBench collects 1,000 prompts across 150 topics from in-the-wild conversations, clusters them into verifiable and useful prompts, and tiers them by model strength to capture evolving challenges. Experimental results show proprietary models outperform open ones, with factuality decreasing as prompts become harder, and VERIFY demonstrating stronger correlation with human judgments than existing baselines. The work provides a practical, updatable framework for tracking LM factuality in realistic interactions and highlights the influence of refusals and subjectivity on evaluation outcomes.
Abstract
The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.
