Table of Contents
Fetching ...

FactSim: Fact-Checking for Opinion Summarization

Leandro Anghinoni, Jorge Sanchez

TL;DR

FactSim addresses the challenge of evaluating opinion summaries produced from multiple reviews by focusing on factual consistency rather than surface similarity. It extracts fact-tuples from both source reviews and the summary, encodes them, and combines coverage and consistency into a single score using the harmonic mean of $f_V$ and $f_N$, where $FactSim = 2 f_V f_N /(f_V+f_N)$. The method is automatic and reference-free, leveraging prompt-based extraction with LLMs and a pre-trained embedding space, and it shows high correlation with human judgments, especially on aspect relevance. This approach enables explainable evaluation through fact-tuple analyses and offers a practical path to robust, scalable assessment of GenAI-generated opinion summaries.

Abstract

We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.

FactSim: Fact-Checking for Opinion Summarization

TL;DR

FactSim addresses the challenge of evaluating opinion summaries produced from multiple reviews by focusing on factual consistency rather than surface similarity. It extracts fact-tuples from both source reviews and the summary, encodes them, and combines coverage and consistency into a single score using the harmonic mean of and , where . The method is automatic and reference-free, leveraging prompt-based extraction with LLMs and a pre-trained embedding space, and it shows high correlation with human judgments, especially on aspect relevance. This approach enables explainable evaluation through fact-tuple analyses and offers a practical path to robust, scalable assessment of GenAI-generated opinion summaries.

Abstract

We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
Paper Structure (19 sections, 3 equations, 4 figures, 3 tables)

This paper contains 19 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Reference summaries for product B000A2FTN6 evaluation data of proposed by bravzinskas2019unsupervised. In the first row, we show the claim frequency in the original eight reviews for the same product. Notice that none of the three reference summaries is able to cover the colored claims (frequency > 1), which should be more relevant. Also, the wording favors extractive models rather than abstractive ones, especially in metrics such as ROUGE and BERTScore.
  • Figure 2: Overview of the proposed method. First, we extract fact tuples from both the original reviews and the summary. Then we project the tuples in a semantic embedding space using a pre-trained encoder. Finally, tuple embeddings are compared using the cosine similarity and summarized into a final metric.
  • Figure 3: Pairwise score matrices for the toy examples in Table \ref{['tab:toy_examples']}. ROUGE-1 and BERTScore attribute a higher score to pairs 1-2 and 3-4, while the correct pairings 1-3 and 2-4. FactSim gives a consistently higher score to matching pairs than to non-matching ones.
  • Figure 4: Fact similarity for a sample product. The X-axis contains facts from the summary and the Y-axis facts from source reviews. Notice that (price, affordable) and product, strong) are claims not covered by the LLM. The LLL, however, included facts that are not present in the original reviews, such as (pendant, recommended), which is a conclusion created by gpt-3.5-turbo based on overall sentiment of the reviews.