Table of Contents
Fetching ...

Core: Robust Factual Precision with Informative Sub-Claim Identification

Zhengping Jiang, Jingyu Zhang, Nathaniel Weir, Seth Ebner, Miriam Wanner, Kate Sanders, Daniel Khashabi, Anqi Liu, Benjamin Van Durme

TL;DR

It is shown that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains, and an evaluation framework supporting easy and modular use of Core and various decomposition strategies is released.

Abstract

Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as \FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. We show that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains. We release an evaluation framework supporting easy and modular use of Core and various decomposition strategies, which we recommend adoption by the community. We also release an expansion of the FActScore biography dataset to facilitate further studies of decomposition-based factual precision evaluation.

Core: Robust Factual Precision with Informative Sub-Claim Identification

TL;DR

It is shown that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains, and an evaluation framework supporting easy and modular use of Core and various decomposition strategies is released.

Abstract

Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as \FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. We show that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains. We release an evaluation framework supporting easy and modular use of Core and various decomposition strategies, which we recommend adoption by the community. We also release an expansion of the FActScore biography dataset to facilitate further studies of decomposition-based factual precision evaluation.
Paper Structure (30 sections, 7 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 7 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: FP of summaries generated from the biography prompt by min-etal-2023-factscore (up) and a prompt that encourages repetitive generation (down): LLMs like chat-gpt-3.5-turbo can easily hack Factual Precision metrics like FActScore by paraphrasing trivially true claims.
  • Figure 2: Core interposes between the decomposition step and the verification step, selecting the most representative set of subclaims that can be identified from the generation to safeguard against trivial or repetitive inputs.
  • Figure 3: Result of deduplication with uniform weighting. Shaded nodes compose one set of viable selection by the algorithm. Up: uniform weighting selects the most fine-grained decomposition. Down: Uniform weighting may select any subclaim within a monotonous entailment chain.
  • Figure 4: Corrupted summaries can achieve higher FActScore than clean summaries simply by mixing in more uninformative (up) or more repetitive (down) sentences (x-axis). However, they do not achieve higher Core-adjusted FActScore.
  • Figure 5: Corrupted summaries can achieve higher FActScore than clean summaries simply by mixing in more uninformative sentences (x-axis) on the entertainment domain. However, they do not achieve higher Core-adjusted FActScore.
  • ...and 5 more figures