Table of Contents
Fetching ...

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

Yuchen Shen, Xiaojun Wan

TL;DR

Opinion summarization requires extracting salient aspects and sentiments from noisy reviews, posing unique evaluation challenges. This work introduces OpinSummEval, a human-annotated benchmark across 14 models and 4 evaluation dimensions to study metric correlations, and comprehensively analyzes 26 automatic metrics. The findings show neural-based metrics generally outperform traditional overlap metrics like ROUGE, though even strong backbones such as BART and GPT-3.5 do not consistently align with human judgments across all dimensions; reference-free neural metrics tend to perform well. The paper highlights gaps in current automated evaluation for opinion summarization and advocates for QA-based and input-output matching paradigms, along with development of domain-specific metrics to advance the field.

Abstract

Opinion summarization sets itself apart from other types of summarization tasks due to its distinctive focus on aspects and sentiments. Although certain automated evaluation methods like ROUGE have gained popularity, we have found them to be unreliable measures for assessing the quality of opinion summaries. In this paper, we present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models. We further explore the correlation between 24 automatic metrics and human ratings across four dimensions. Our findings indicate that metrics based on neural networks generally outperform non-neural ones. However, even metrics built on powerful backbones, such as BART and GPT-3/3.5, do not consistently correlate well across all dimensions, highlighting the need for advancements in automated evaluation methods for opinion summarization. The code and data are publicly available at https://github.com/A-Chicharito-S/OpinSummEval/tree/main.

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

TL;DR

Opinion summarization requires extracting salient aspects and sentiments from noisy reviews, posing unique evaluation challenges. This work introduces OpinSummEval, a human-annotated benchmark across 14 models and 4 evaluation dimensions to study metric correlations, and comprehensively analyzes 26 automatic metrics. The findings show neural-based metrics generally outperform traditional overlap metrics like ROUGE, though even strong backbones such as BART and GPT-3.5 do not consistently align with human judgments across all dimensions; reference-free neural metrics tend to perform well. The paper highlights gaps in current automated evaluation for opinion summarization and advocates for QA-based and input-output matching paradigms, along with development of domain-specific metrics to advance the field.

Abstract

Opinion summarization sets itself apart from other types of summarization tasks due to its distinctive focus on aspects and sentiments. Although certain automated evaluation methods like ROUGE have gained popularity, we have found them to be unreliable measures for assessing the quality of opinion summaries. In this paper, we present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models. We further explore the correlation between 24 automatic metrics and human ratings across four dimensions. Our findings indicate that metrics based on neural networks generally outperform non-neural ones. However, even metrics built on powerful backbones, such as BART and GPT-3/3.5, do not consistently correlate well across all dimensions, highlighting the need for advancements in automated evaluation methods for opinion summarization. The code and data are publicly available at https://github.com/A-Chicharito-S/OpinSummEval/tree/main.
Paper Structure (29 sections, 2 equations, 4 figures, 11 tables)

This paper contains 29 sections, 2 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The annotation distribution for each dimension. For each score, we plot the average frequency it is being scored across 4 dimensions, with $\pm$ its standard deviation (marked with $\blacktriangledown$ and $\blacktriangle$).
  • Figure 2: The prompt for ChatGPTgao2023humanlike.
  • Figure 3: The prompt used in G-Eval liu2023geval and the generated CoTs for 4 dimensions, where the differences among the CoTs for different dimensions are the descriptions for step 2.
  • Figure 4: The guidelines for the annotation, with key information shown (we omit the example from the dev set of Yelp due to limited spaces).