Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
Cheng-Han Chiang, Hung-yi Lee
TL;DR
This paper tackles the challenge of evaluating factuality in long-form LLM generations, showing that even paragraphs composed of verifiable facts can be non-factual due to entity ambiguity. It introduces AmbigBio, a dataset of 500 ambiguous names, and Disambig-FActScore (D-FActScore), a factuality metric that groups atomic facts by potential single entities and verifies each group against a linked entity's knowledge base. Through human and automatic evaluations across multiple open-source models and ChatGPT, the authors find that traditional FActScore overestimates factuality in ambiguous scenarios and that D-FActScore yields more accurate assessments and different model rankings, with ChatGPT performing best at disambiguation. They also provide an automatic evaluation pipeline that closely tracks human judgments, enabling scalable comparison across models and prompting strategies. The findings highlight the need for ambiguity-aware factuality metrics in retrieval-augmented generation and suggest future work beyond Wikipedia-focused verification.
Abstract
Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims, making evaluating factuality difficult. Prior works evaluate the factuality of a long paragraph by decomposing it into multiple facts, verifying those facts independently, and aggregating the results. Such methods assume that combining factual claims forms a factual paragraph. The above assumption can be violated: we show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity. We further reveal that existing factuality metrics, including FActScore and citation recall, cannot properly evaluate these non-factual paragraphs and overestimate their factuality. To address this, we introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities. We evaluate the D-FActScores of people biographies generated by retrieval-augmented LLMs. We show that D-FActScore can better assess the factuality of paragraphs with entity ambiguity than FActScore. We also find that four widely used open-source LLMs tend to mix information of distinct entities to form non-factual paragraphs, making their D-FActScore much lower than FActScore by over 10%.
