FactSim: Fact-Checking for Opinion Summarization
Leandro Anghinoni, Jorge Sanchez
TL;DR
FactSim addresses the challenge of evaluating opinion summaries produced from multiple reviews by focusing on factual consistency rather than surface similarity. It extracts fact-tuples from both source reviews and the summary, encodes them, and combines coverage and consistency into a single score using the harmonic mean of $f_V$ and $f_N$, where $FactSim = 2 f_V f_N /(f_V+f_N)$. The method is automatic and reference-free, leveraging prompt-based extraction with LLMs and a pre-trained embedding space, and it shows high correlation with human judgments, especially on aspect relevance. This approach enables explainable evaluation through fact-tuple analyses and offers a practical path to robust, scalable assessment of GenAI-generated opinion summaries.
Abstract
We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
