Table of Contents
Fetching ...

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

Cheng-Han Chiang, Hung-yi Lee

TL;DR

This paper tackles the challenge of evaluating factuality in long-form LLM generations, showing that even paragraphs composed of verifiable facts can be non-factual due to entity ambiguity. It introduces AmbigBio, a dataset of 500 ambiguous names, and Disambig-FActScore (D-FActScore), a factuality metric that groups atomic facts by potential single entities and verifies each group against a linked entity's knowledge base. Through human and automatic evaluations across multiple open-source models and ChatGPT, the authors find that traditional FActScore overestimates factuality in ambiguous scenarios and that D-FActScore yields more accurate assessments and different model rankings, with ChatGPT performing best at disambiguation. They also provide an automatic evaluation pipeline that closely tracks human judgments, enabling scalable comparison across models and prompting strategies. The findings highlight the need for ambiguity-aware factuality metrics in retrieval-augmented generation and suggest future work beyond Wikipedia-focused verification.

Abstract

Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims, making evaluating factuality difficult. Prior works evaluate the factuality of a long paragraph by decomposing it into multiple facts, verifying those facts independently, and aggregating the results. Such methods assume that combining factual claims forms a factual paragraph. The above assumption can be violated: we show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity. We further reveal that existing factuality metrics, including FActScore and citation recall, cannot properly evaluate these non-factual paragraphs and overestimate their factuality. To address this, we introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities. We evaluate the D-FActScores of people biographies generated by retrieval-augmented LLMs. We show that D-FActScore can better assess the factuality of paragraphs with entity ambiguity than FActScore. We also find that four widely used open-source LLMs tend to mix information of distinct entities to form non-factual paragraphs, making their D-FActScore much lower than FActScore by over 10%.

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

TL;DR

This paper tackles the challenge of evaluating factuality in long-form LLM generations, showing that even paragraphs composed of verifiable facts can be non-factual due to entity ambiguity. It introduces AmbigBio, a dataset of 500 ambiguous names, and Disambig-FActScore (D-FActScore), a factuality metric that groups atomic facts by potential single entities and verifies each group against a linked entity's knowledge base. Through human and automatic evaluations across multiple open-source models and ChatGPT, the authors find that traditional FActScore overestimates factuality in ambiguous scenarios and that D-FActScore yields more accurate assessments and different model rankings, with ChatGPT performing best at disambiguation. They also provide an automatic evaluation pipeline that closely tracks human judgments, enabling scalable comparison across models and prompting strategies. The findings highlight the need for ambiguity-aware factuality metrics in retrieval-augmented generation and suggest future work beyond Wikipedia-focused verification.

Abstract

Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims, making evaluating factuality difficult. Prior works evaluate the factuality of a long paragraph by decomposing it into multiple facts, verifying those facts independently, and aggregating the results. Such methods assume that combining factual claims forms a factual paragraph. The above assumption can be violated: we show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity. We further reveal that existing factuality metrics, including FActScore and citation recall, cannot properly evaluate these non-factual paragraphs and overestimate their factuality. To address this, we introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities. We evaluate the D-FActScores of people biographies generated by retrieval-augmented LLMs. We show that D-FActScore can better assess the factuality of paragraphs with entity ambiguity than FActScore. We also find that four widely used open-source LLMs tend to mix information of distinct entities to form non-factual paragraphs, making their D-FActScore much lower than FActScore by over 10%.
Paper Structure (42 sections, 3 equations, 3 figures, 8 tables)

This paper contains 42 sections, 3 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Output of Llama-13b-chat when prompted to generate a biography for Dick Hanley. While the paragraph is misleading and non-factual, all the facts in the paragraph can be supported by the Wikipedia of Dick Hanley (Swimmer)✔ or Dick Hanley (AmericanFootBall)✔, yielding 100% FActScore. D-FActScore groups atomic facts that appear to refer to the same individual based on the paragraph (Figure \ref{['fig:illustration.pdf']}(c)), finds an entity that best matches that individual from the knowledge source (Figure \ref{['fig:illustration.pdf']}(d)), and only uses the information of that entity to verify the facts in that group ✔.
  • Figure 2: The interface used for annotation.
  • Figure 3: The instructions used for annotation. We do not show the examples in the instructions in this figure.