FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Alessandro Scirè; Karim Ghonim; Roberto Navigli

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Alessandro Scirè, Karim Ghonim, Roberto Navigli

TL;DR

This work proposes Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric that sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.

Abstract

Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 2 figures, 10 tables)

This paper contains 33 sections, 12 equations, 2 figures, 10 tables.

Introduction
Related Work
NLI-based metrics.
LLM-based metrics.
Claim extraction-based evaluation.
FENICE
Claim extraction
NLI-based claim scoring
Coreference Resolution
Aligning claims across multiple input text granularities
Experiments and Results
Claim extraction
Experimental setup.
Datasets.
Metrics.
...and 18 more sections

Figures (2)

Figure 1: Overview of FENICE: the process begins with the extraction of claims from a given summary (step 1). Extracted claims are then aligned with specific sections of the input document (step 2). Finally, we refine the obtained alignments through a coreference-resolution-based approach (step 3). Best seen in color.
Figure 2: LLM prompt for extracting claims given a summary.

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

TL;DR

Abstract

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)