UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Zhaoheng Huang; Zhicheng Dou; Yutao Zhu; Ji-rong Wen

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Ji-rong Wen

TL;DR

UFO presents a unified, flexible pipeline for evaluating the factuality of LLM outputs by enabling plug-and-play fact sources and a unified verification workflow. It extracts fact units, verifies them against configured sources, and discriminates consistency with an LLM-based module to produce a factuality score, enabling cross-task analysis across five tasks. Experimental results show that human-written evidence and reference documents are most helpful for QA tasks, while search engine results and LLM knowledge are crucial for news generation, with certain sources substitutable in retrieval-augmented QA. The work offers a scalable framework and accompanying data/code to advance robust factuality evaluation in LLMs.

Abstract

Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Text Generation and Hallucination
Factuality Evaluation
Methodology
Problem Statement
Fact Sources
UFO Evaluation Framework
Fact Unit Extraction
Fact Source Verification
Fact Consistency Discrimination
Evaluation Criteria
Evaluation Scenarios
Experiments
Datasets
...and 9 more sections

Figures (4)

Figure 1: Our proposed factuality evaluation pipeline UFO . We integrate four fact sources within various evaluation scenarios to assess the factuality score.
Figure 2: A case of evaluating Vicuna-generated text within the retrieval-augmented QA task where $S = \langle S_{\text{he}}, S_{\text{rd}}, S_{\text{se}}, S_{\text{lk}} \rangle$. Details of the generated text are omitted for clarity. The extracted answers are highlighted.
Figure 3: MR-PT discriminative power curve of evaluation metrics on datasets. The closer the curve is to the bottom-left corner, the better the evaluation metric is. Our proposed model curve is represented by a solid line, while the baseline model curve is depicted using a dashed line.
Figure 4: MR-PT discriminative power curve of evaluation metrics on datasets. The closer the curve is to the bottom-left corner, the better the evaluation metric is. We incorporate reference documents retrieved by Bing Chat as part of the fact source $S_{\text{rd}}$. Non-LLM-based methods are omitted for clarity.

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

TL;DR

Abstract

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)