Using Similarity to Evaluate Factual Consistency in Summaries
Yuxuan Ye, Edwin Simpson, Raul Santos Rodriguez
TL;DR
This work addresses the challenge of evaluating factual consistency in abstractive summaries, where traditional metrics like ROUGE fall short. It introduces SBERTScore, a zero-shot, sentence-level factuality metric based on cosine similarity between sentence embeddings of source and summary sentences, with granularity choices to avoid truncation. Empirical results show SBERTScore compares favorably with token-level metrics like BERTScore and appeals to zero-shot comparisons with NLI and QA-based metrics, while offering superior efficiency. The paper also demonstrates that combining diverse metrics can further enhance factuality detection, and discusses limitations in handling negation and neutral-highly similar text, pointing to future research directions.
Abstract
Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Early summary factuality evaluation metrics are usually based on n-gram overlap and embedding similarity, but are reported fail to align with human annotations. Therefore, many techniques for detecting factual inconsistencies build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In this paper, we revisit similarity-based metrics, showing that this failure stems from the comparison text selection and its granularity. We propose a new zero-shot factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences between the summary and the source document. It outperforms widely-used word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics on the benchmark without needing any fine-tuning. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries. We demonstrate how a combination of techniques is more effective in detecting various types of error.
