Table of Contents
Fetching ...

A Comparative Study of Quality Evaluation Methods for Text Summarization

Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, Junhua Ding

TL;DR

This paper tackles the challenge of evaluating text summarization, particularly when reference summaries are unavailable or costly to obtain. It introduces an LLM-based evaluation framework and conducts a comprehensive comparison with eight automatic metrics and human judgments on patent documents, revealing that LLM-based assessments align closely with human evaluation, while traditional metrics often diverge. The study further demonstrates that an iterative, LLM-guided refinement loop can improve summary quality in terms of clarity and coverage, albeit with a trade-off in accuracy. The work suggests that LLM-based evaluation offers a scalable, domain-robust alternative for practical summarization evaluation and offers a pathway to automatic, continual improvement of summaries in specialized domains.

Abstract

Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.

A Comparative Study of Quality Evaluation Methods for Text Summarization

TL;DR

This paper tackles the challenge of evaluating text summarization, particularly when reference summaries are unavailable or costly to obtain. It introduces an LLM-based evaluation framework and conducts a comprehensive comparison with eight automatic metrics and human judgments on patent documents, revealing that LLM-based assessments align closely with human evaluation, while traditional metrics often diverge. The study further demonstrates that an iterative, LLM-guided refinement loop can improve summary quality in terms of clarity and coverage, albeit with a trade-off in accuracy. The work suggests that LLM-based evaluation offers a scalable, domain-robust alternative for practical summarization evaluation and offers a pathway to automatic, continual improvement of summaries in specialized domains.

Abstract

Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.
Paper Structure (33 sections, 4 figures, 7 tables)

This paper contains 33 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An Overview of Summarization Evaluation Methods
  • Figure 2: An example of the evaluation data collection form designed on APPEN platform.
  • Figure 3: Length distributions of the entire dataset (top) and the evaluation sample (bottom).
  • Figure 4: Prompt designed for iteratively improving summarization quality