UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models
Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Ji-rong Wen
TL;DR
UFO presents a unified, flexible pipeline for evaluating the factuality of LLM outputs by enabling plug-and-play fact sources and a unified verification workflow. It extracts fact units, verifies them against configured sources, and discriminates consistency with an LLM-based module to produce a factuality score, enabling cross-task analysis across five tasks. Experimental results show that human-written evidence and reference documents are most helpful for QA tasks, while search engine results and LLM knowledge are crucial for news generation, with certain sources substitutable in retrieval-augmented QA. The work offers a scalable framework and accompanying data/code to advance robust factuality evaluation in LLMs.
Abstract
Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.
