Table of Contents
Fetching ...

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, Lu Wang

TL;DR

FactBench introduces a dynamic, real-world factuality benchmark and the VERIFY pipeline to evaluate language models on their ability to produce verifiable outputs. VERIFY decomposes LM responses into content units, labels them against web evidence as supported, unsupported, or undecidable, and combines these into a Hallucination Score to rank prompts. FactBench collects 1,000 prompts across 150 topics from in-the-wild conversations, clusters them into verifiable and useful prompts, and tiers them by model strength to capture evolving challenges. Experimental results show proprietary models outperform open ones, with factuality decreasing as prompts become harder, and VERIFY demonstrating stronger correlation with human judgments than existing baselines. The work provides a practical, updatable framework for tracking LM factuality in realistic interactions and highlights the influence of refusals and subjectivity on evaluation outcomes.

Abstract

The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

TL;DR

FactBench introduces a dynamic, real-world factuality benchmark and the VERIFY pipeline to evaluate language models on their ability to produce verifiable outputs. VERIFY decomposes LM responses into content units, labels them against web evidence as supported, unsupported, or undecidable, and combines these into a Hallucination Score to rank prompts. FactBench collects 1,000 prompts across 150 topics from in-the-wild conversations, clusters them into verifiable and useful prompts, and tiers them by model strength to capture evolving challenges. Experimental results show proprietary models outperform open ones, with factuality decreasing as prompts become harder, and VERIFY demonstrating stronger correlation with human judgments than existing baselines. The work provides a practical, updatable framework for tracking LM factuality in realistic interactions and highlights the influence of refusals and subjectivity on evaluation outcomes.

Abstract

The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.

Paper Structure

This paper contains 44 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: This figure outlines the two-step process we use to evaluate LM responses. Step 1 (left) involves cleaning, clustering, and evaluating prompts for verifiability and usefulness. Step 2 (right) decomposes responses into units, retrieves external evidence, and generates factuality labels (supported, unsupported, undecidable) with a hallucination score to flag inaccuracies. This process involves the collection and appropriateness assessment of hallucination prompts.
  • Figure 2: Statistics of different factuality benchmarks. FactBench is the first dynamic and in-the-wild factuality evaluation benchmark with diverse topic coverage.
  • Figure 3: Average percentage of unsupported (UnS) and undecidable (UnD) labels in different LMs (*Instruct version) evaluated by VERIFY. Responses from Llama3.1-405B-Instruct contain the highest proportion of undecidable units across all LMs.
  • Figure 4: Response-level correlation between factuality evaluation methods and human annotations of 40 prompts across 4 LMs (averaged via z-score). F refers to Factual labels, and O refers to Other. VERIFY achieves the highest correlation with human judgments compared to baselines.
  • Figure 5: Refusal rate of different LMs across Hard, Moderate, and Easy tiers of FactBench. Gemini1.5-Pro shows a significantly higher refusal rate than other LMs.
  • ...and 4 more figures