Table of Contents
Fetching ...

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

Xiang Dai, Sarvnaz Karimi, Biaoyan Fang

TL;DR

It is argued that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics.

Abstract

Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal and the role of summarisation in the workflow.

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

TL;DR

It is argued that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics.

Abstract

Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal and the role of summarisation in the workflow.
Paper Structure (40 sections, 5 equations, 2 figures, 2 tables)

This paper contains 40 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The distribution of consistency scores, measured using WeCheck wu-baidu-2023-acl-wecheck, between source text and reference summary from different datasets. A score of $1$ indicates a higher consistency level, while $0$ indicates inconsistency. CNN/DM and XSum datasets zhang-ladhak-2024-tacl-benchmark-llm-summarization include news articles, SAMSumgliwa-samsung-2019-samsum messenger-like conversations, arXivcohan-dernoncourt-2018-naacl-long-summarization scholarly articles, and MTSDialogabacha-microsoft-2023-eacl-clinical-note from Doctor-Patient encounters.
  • Figure 2: Evaluation results using WeCheck wu-baidu-2023-acl-wecheck on two tasks proposed in Multi-LexSum shen-allenai-2022-neurips-multi-lexsum, where summaries are generated at different target levels of granularity: tiny (25 words, on average), and short (130 words). Prompts used to generate summaries can be found in Appendix Section \ref{['section_implementation_details']}.