Table of Contents
Fetching ...

EVA-Score: Evaluating Abstractive Long-form Summarization on Informativeness through Extraction and Validation

Yuchen Fan, Yazhe Wan, Xin Zhong, Haonan Cheng, Ning Ding, Bowen Zhou

TL;DR

The paper tackles the challenge of evaluating abstractive long-form summarization by focusing on informativeness rather than surface similarity. It introduces EVA-Score, a four-step pipeline (Atomic Fact Generation, Atomic Fact Chain Generation, Document-level Relation Extraction, and LLM-based Validation) that tests information units against references. Across multiple datasets, EVA-Score demonstrates the strongest alignment with human judgments and reveals how longer-context LLMs benefit long-form evaluation, particularly for GPT-4. Diagnostic analysis highlights DocRE and validation stages as bottlenecks, suggesting avenues for refinement and broader applicability of information-centric evaluation beyond summarization.

Abstract

Since LLMs emerged, more attention has been paid to abstractive long-form summarization, where longer input sequences indicate more information contained. Nevertheless, the automatic evaluation of such summaries remains underexplored. The current evaluation metrics for long-form summarization either use similarity-based metrics like ROUGE and BERTScore or LLM-based metrics using appropriate prompts or pre-defined schema. We argue that the former only relies on similarity and fails to consider informativeness while the latter lacks quantitative analysis of informative richness, and is rather subjective and hard to explain. Current evaluation metrics either use traditional metrics like ROUGE and BERTScore, which rely on surface-level similarity and fail to consider informativeness, or simple LLM-based metrics, which are not robust and easily overwhelmed by the long contexts. In this paper, we propose a new evaluation metric called EVA-Score to extract all information from the given summaries, identify overlapped information based on reference, and calculate the information score. We test EVA-Score on several datasets and the experimental results reveal that EVA-Score shows the highest correlation with humans. We also re-evaluate the performance of LLMs on long-form summarization from the information perspective. The results indicate that responses of LLMs still have a gap with the human-written answers. Moreover, we provide a detailed analysis of the effectiveness of EVA-Score, forecasting future ways to automatically evaluate abstractive long-form summarization.

EVA-Score: Evaluating Abstractive Long-form Summarization on Informativeness through Extraction and Validation

TL;DR

The paper tackles the challenge of evaluating abstractive long-form summarization by focusing on informativeness rather than surface similarity. It introduces EVA-Score, a four-step pipeline (Atomic Fact Generation, Atomic Fact Chain Generation, Document-level Relation Extraction, and LLM-based Validation) that tests information units against references. Across multiple datasets, EVA-Score demonstrates the strongest alignment with human judgments and reveals how longer-context LLMs benefit long-form evaluation, particularly for GPT-4. Diagnostic analysis highlights DocRE and validation stages as bottlenecks, suggesting avenues for refinement and broader applicability of information-centric evaluation beyond summarization.

Abstract

Since LLMs emerged, more attention has been paid to abstractive long-form summarization, where longer input sequences indicate more information contained. Nevertheless, the automatic evaluation of such summaries remains underexplored. The current evaluation metrics for long-form summarization either use similarity-based metrics like ROUGE and BERTScore or LLM-based metrics using appropriate prompts or pre-defined schema. We argue that the former only relies on similarity and fails to consider informativeness while the latter lacks quantitative analysis of informative richness, and is rather subjective and hard to explain. Current evaluation metrics either use traditional metrics like ROUGE and BERTScore, which rely on surface-level similarity and fail to consider informativeness, or simple LLM-based metrics, which are not robust and easily overwhelmed by the long contexts. In this paper, we propose a new evaluation metric called EVA-Score to extract all information from the given summaries, identify overlapped information based on reference, and calculate the information score. We test EVA-Score on several datasets and the experimental results reveal that EVA-Score shows the highest correlation with humans. We also re-evaluate the performance of LLMs on long-form summarization from the information perspective. The results indicate that responses of LLMs still have a gap with the human-written answers. Moreover, we provide a detailed analysis of the effectiveness of EVA-Score, forecasting future ways to automatically evaluate abstractive long-form summarization.
Paper Structure (19 sections, 3 equations, 4 tables)