Table of Contents
Fetching ...

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Ji-rong Wen

TL;DR

UFO presents a unified, flexible pipeline for evaluating the factuality of LLM outputs by enabling plug-and-play fact sources and a unified verification workflow. It extracts fact units, verifies them against configured sources, and discriminates consistency with an LLM-based module to produce a factuality score, enabling cross-task analysis across five tasks. Experimental results show that human-written evidence and reference documents are most helpful for QA tasks, while search engine results and LLM knowledge are crucial for news generation, with certain sources substitutable in retrieval-augmented QA. The work offers a scalable framework and accompanying data/code to advance robust factuality evaluation in LLMs.

Abstract

Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

TL;DR

UFO presents a unified, flexible pipeline for evaluating the factuality of LLM outputs by enabling plug-and-play fact sources and a unified verification workflow. It extracts fact units, verifies them against configured sources, and discriminates consistency with an LLM-based module to produce a factuality score, enabling cross-task analysis across five tasks. Experimental results show that human-written evidence and reference documents are most helpful for QA tasks, while search engine results and LLM knowledge are crucial for news generation, with certain sources substitutable in retrieval-augmented QA. The work offers a scalable framework and accompanying data/code to advance robust factuality evaluation in LLMs.

Abstract

Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.
Paper Structure (24 sections, 3 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our proposed factuality evaluation pipeline UFO . We integrate four fact sources within various evaluation scenarios to assess the factuality score.
  • Figure 2: A case of evaluating Vicuna-generated text within the retrieval-augmented QA task where $S = \langle S_{\text{he}}, S_{\text{rd}}, S_{\text{se}}, S_{\text{lk}} \rangle$. Details of the generated text are omitted for clarity. The extracted answers are highlighted.
  • Figure 3: MR-PT discriminative power curve of evaluation metrics on datasets. The closer the curve is to the bottom-left corner, the better the evaluation metric is. Our proposed model curve is represented by a solid line, while the baseline model curve is depicted using a dashed line.
  • Figure 4: MR-PT discriminative power curve of evaluation metrics on datasets. The closer the curve is to the bottom-left corner, the better the evaluation metric is. We incorporate reference documents retrieved by Bing Chat as part of the fact source $S_{\text{rd}}$. Non-LLM-based methods are omitted for clarity.