FaaF: Facts as a Function for the evaluation of generated text

Vasileios Katranidis; Gabor Barany

FaaF: Facts as a Function for the evaluation of generated text

Vasileios Katranidis, Gabor Barany

TL;DR

FaaF is introduced, a new approach to the fact verification task that leverages the function-calling capabilities of LMs and significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods.

Abstract

The demand for accurate and efficient verification of information in texts generated by large language models (LMs) is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions.

FaaF: Facts as a Function for the evaluation of generated text

TL;DR

Abstract

Paper Structure (6 sections, 6 equations, 3 figures, 3 tables)

This paper contains 6 sections, 6 equations, 3 figures, 3 tables.

Introduction
Related Work
Facts as a Function
Assessment of fact verification formulations in the RAG setting
Results
Conclusions & future work

Figures (3)

Figure 1: An overview of FaaF, a constructor dynamically creates a function object from a set of fact statements. Function calling allows LMeval to verify all facts within a single call when provided with an input reference text. FaaF significantly reduces the error rate in identifying unsupported facts compared to prompting whilst reducing the number of LMeval calls and output tokens by more than 5 times.
Figure 2: Overview of the factual recall evaluation for RAG. Given a set of ground truth Answers, facts are extracted via LMf. The hypothesized responses of the RAG (in this instance Ungrounded Answer and Poor Answer) are then tested for recall against the extracted facts.
Figure 3: LMeval call count for a full evaluation of WikiEvalFacts. FaaF formulations result in more than five times less LM calls considering an average of 5.6 fact statements per QA pair.

FaaF: Facts as a Function for the evaluation of generated text

TL;DR

Abstract

FaaF: Facts as a Function for the evaluation of generated text

Authors

TL;DR

Abstract

Table of Contents

Figures (3)