Table of Contents
Fetching ...

All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, Nick Craswell

TL;DR

This paper argues that existing factuality metrics treat all claims equally, failing to detect errors in critical information. It introduces Vital, an importance-weighted set of metrics that decompose responses into subclaims, rank them by query relevance, and label them as vital, okay, or less-important; it also creates VitalErrors, a 6,733-item adversarial benchmark across six datasets to test sensitivity to missing or incorrect key information. Through experiments using GPT-4o, Vital metrics, especially at the response level, show greater sensitivity to key-information errors than traditional metrics like FActScore and NuggetRecall. The work provides a framework and benchmark for more reliable factuality evaluation of LLM generations, with implications for safer and more trustworthy deployment of language models.

Abstract

Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.

All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

TL;DR

This paper argues that existing factuality metrics treat all claims equally, failing to detect errors in critical information. It introduces Vital, an importance-weighted set of metrics that decompose responses into subclaims, rank them by query relevance, and label them as vital, okay, or less-important; it also creates VitalErrors, a 6,733-item adversarial benchmark across six datasets to test sensitivity to missing or incorrect key information. Through experiments using GPT-4o, Vital metrics, especially at the response level, show greater sensitivity to key-information errors than traditional metrics like FActScore and NuggetRecall. The work provides a framework and benchmark for more reliable factuality evaluation of LLM generations, with implications for safer and more trustworthy deployment of language models.

Abstract

Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Three responses to the same query receive similarly high precision and recall scores. Precision and recall metrics are insensitive to errors in key information for long responses. Both erroneous responses still achieve high precision and recall scores.
  • Figure 2: Four responses to the same query with varying error type and severity, demonstrating that not all errors are equal. Standard precision/recall give similar scores, while response-level metrics distinguish missing or wrong vital information from peripheral errors.
  • Figure 3: Cumulative precision over subclaim position for single-answer queries. Wrong responses show low precision early on due to falsified key claims, while normal and missing responses stay consistently high.
  • Figure 4: Cumulative precision over subclaim position for open-ended queries. Wrong responses show low precision early on due to falsified key claims, while normal and missing responses stay consistently high.
  • Figure 5: Linear decay weighted precision and recall. Highlights the differences between normal, missing, and wrong responses, though less strongly than response-level metrics.