Table of Contents
Fetching ...

Using Similarity to Evaluate Factual Consistency in Summaries

Yuxuan Ye, Edwin Simpson, Raul Santos Rodriguez

TL;DR

This work addresses the challenge of evaluating factual consistency in abstractive summaries, where traditional metrics like ROUGE fall short. It introduces SBERTScore, a zero-shot, sentence-level factuality metric based on cosine similarity between sentence embeddings of source and summary sentences, with granularity choices to avoid truncation. Empirical results show SBERTScore compares favorably with token-level metrics like BERTScore and appeals to zero-shot comparisons with NLI and QA-based metrics, while offering superior efficiency. The paper also demonstrates that combining diverse metrics can further enhance factuality detection, and discusses limitations in handling negation and neutral-highly similar text, pointing to future research directions.

Abstract

Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Early summary factuality evaluation metrics are usually based on n-gram overlap and embedding similarity, but are reported fail to align with human annotations. Therefore, many techniques for detecting factual inconsistencies build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In this paper, we revisit similarity-based metrics, showing that this failure stems from the comparison text selection and its granularity. We propose a new zero-shot factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences between the summary and the source document. It outperforms widely-used word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics on the benchmark without needing any fine-tuning. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries. We demonstrate how a combination of techniques is more effective in detecting various types of error.

Using Similarity to Evaluate Factual Consistency in Summaries

TL;DR

This work addresses the challenge of evaluating factual consistency in abstractive summaries, where traditional metrics like ROUGE fall short. It introduces SBERTScore, a zero-shot, sentence-level factuality metric based on cosine similarity between sentence embeddings of source and summary sentences, with granularity choices to avoid truncation. Empirical results show SBERTScore compares favorably with token-level metrics like BERTScore and appeals to zero-shot comparisons with NLI and QA-based metrics, while offering superior efficiency. The paper also demonstrates that combining diverse metrics can further enhance factuality detection, and discusses limitations in handling negation and neutral-highly similar text, pointing to future research directions.

Abstract

Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Early summary factuality evaluation metrics are usually based on n-gram overlap and embedding similarity, but are reported fail to align with human annotations. Therefore, many techniques for detecting factual inconsistencies build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In this paper, we revisit similarity-based metrics, showing that this failure stems from the comparison text selection and its granularity. We propose a new zero-shot factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences between the summary and the source document. It outperforms widely-used word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics on the benchmark without needing any fine-tuning. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries. We demonstrate how a combination of techniques is more effective in detecting various types of error.
Paper Structure (32 sections, 3 equations, 2 figures, 11 tables)

This paper contains 32 sections, 3 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Average balanced accuracy of combined metrics on the benchmark. The diagonal is the balanced accuracy of the original evaluation metric (highlighted in blue). The upper triangular matrix is the balanced accuracy of joint metrics using OR and the lower triangular matrix is based on AND. Red blocks highlight the balanced accuracy that is improved over two original metrics, and green blocks highlight those are lower than both original metrics. All improvements and declines are statistically significant with $p<0.05$.
  • Figure 2: Cohen's $\kappa$ agreement score among different metrics on the benchmark dataset. The higher agreement is in deeper red.