Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

Mousumi Akter; Santu Karmaker

Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

Mousumi Akter, Santu Karmaker

TL;DR

The paper tackles the inadequacies of ROUGE and the original Sem-nCG by introducing redundancy-aware multi-reference Sem-nCG for extractive summarization evaluation. It defines a redundancy penalty and a final score $\text{Score} = \lambda \cdot \text{Sem-nCG} + (1 - \lambda) \cdot (1 - \text{Score}_{red})$, with $\lambda \in [0,1]$, and constructs groundtruth rankings from multiple sentence embeddings. The authors demonstrate stronger alignment with human judgments than ROUGE and BERTScore in both single- and multi-reference settings, and they show how to adapt the metric to multi-reference evaluations via ensemble-groundtruth strategies. Overall, the proposed metric offers a more reliable, semantically aware, and redundancy-conscious tool for evaluating extractive summarization, with practical guidance on hyperparameter choice and reference handling.

Abstract

The ROUGE metric is commonly used to evaluate extractive summarization task, but it has been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the extractive summarizer. Previous research has introduced a gain-based automated metric called Sem-nCG that addresses these issues, as it is both rank and semantic aware. However, it does not consider the amount of redundancy present in a model summary and currently does not support evaluation with multiple reference summaries. It is essential to have a model summary that balances importance and diversity, but finding a metric that captures both of these aspects is challenging. In this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how the revised Sem-nCG metric can be used to evaluate model summaries against multiple references as well which was missing in previous research. Experimental results demonstrate that the revised Sem-nCG metric has a stronger correlation with human judgments compared to the previous Sem-nCG metric and traditional ROUGE and BERTScore metric for both single and multiple reference scenarios.

Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

TL;DR

, with

, and constructs groundtruth rankings from multiple sentence embeddings. The authors demonstrate stronger alignment with human judgments than ROUGE and BERTScore in both single- and multi-reference settings, and they show how to adapt the metric to multi-reference evaluations via ensemble-groundtruth strategies. Overall, the proposed metric offers a more reliable, semantically aware, and redundancy-conscious tool for evaluating extractive summarization, with practical guidance on hyperparameter choice and reference handling.

Abstract

Paper Structure (17 sections, 3 equations, 2 figures, 6 tables)

This paper contains 17 sections, 3 equations, 2 figures, 6 tables.

Introduction
Redundancy-aware Sem-nCG Metric
Experimental Setup
Results
Redundancy-aware Sem-nCG
Hyperparameter Choice
Redundancy-aware Sem-nCG for Evaluation with Multiple References
Related Work
Conclusion
Limitations
Ethics Statement
Acknowledgements
Appendix
Explanation of Metrics for Scorered
Human Evaluation Components
...and 2 more sections

Figures (2)

Figure 1: Kendall Tau ($\tau$) Correlation coefficient when lambda ($\lambda)$$\in [0, 1]$ from (a)-(c) for Consistency, (d)-(f) for relevance, (g)-(i) for coherence and (j)-(l) for Fluency dimension when ROUGE score is used as redundancy penalty for less overlapping reference (LOR), medium overlapping reference (MOR) and high overlapping reference (HOR).
Figure 2: Kendall Tau ($\tau$) correlation coefficient when lambda ($\lambda)$$\in [0, 1]$ from (a)-(c) for consistency, (d)-(f) for relevance, (g)-(i) for coherence and (j)-(l) for fluency dimension when BERTScore is used as redundancy penalty for less overlapping reference (LOR), medium overlapping reference (MOR) and high overlapping reference (HOR).

Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

TL;DR

Abstract

Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)