Table of Contents
Fetching ...

Less is More for Long Document Summary Evaluation by LLMs

Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka

TL;DR

The work tackles the high cost and Lost-in-the-Middle risk of evaluating long-document summaries with LLMs by introducing Extract-then-Evaluate, which first selects key sentences to form a dense short document x' and then assesses a generated summary hat{y} against x' using LLM prompts. Across arXiv, GovReport, PubMed, and SQuALITY, the method improves alignment with human judgments (higher Pearson r and Spearman ρ) while reducing evaluation costs, outperforming full-document evaluation in many settings. The authors compare multiple extraction strategies (LEAD, ROUGE, BERTScore, NLI) and find ROUGE-based extraction often yields the best results, with an optimal extracted length around 1000–2000 tokens. Practical guidance includes longer-than-summary extracted content, ROUGE-based extraction as a simple, effective choice, and budget-aware configurations (e.g., 1024-token limits), highlighting a path toward more cost-efficient yet accurate LLM-based evaluation of long documents.

Abstract

Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.

Less is More for Long Document Summary Evaluation by LLMs

TL;DR

The work tackles the high cost and Lost-in-the-Middle risk of evaluating long-document summaries with LLMs by introducing Extract-then-Evaluate, which first selects key sentences to form a dense short document x' and then assesses a generated summary hat{y} against x' using LLM prompts. Across arXiv, GovReport, PubMed, and SQuALITY, the method improves alignment with human judgments (higher Pearson r and Spearman ρ) while reducing evaluation costs, outperforming full-document evaluation in many settings. The authors compare multiple extraction strategies (LEAD, ROUGE, BERTScore, NLI) and find ROUGE-based extraction often yields the best results, with an optimal extracted length around 1000–2000 tokens. Practical guidance includes longer-than-summary extracted content, ROUGE-based extraction as a simple, effective choice, and budget-aware configurations (e.g., 1024-token limits), highlighting a path toward more cost-efficient yet accurate LLM-based evaluation of long documents.

Abstract

Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.
Paper Structure (24 sections, 9 figures, 6 tables)

This paper contains 24 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of the long document summary evaluation task by LLMs. Evaluating long document summaries by LLMs is expensive and shows limited alignment with human evaluations. This study demonstrates that extracting important sentences for evaluation in advance not only reduces evaluation costs but also exhibits better alignment with human evaluations.
  • Figure 2: Distribution of sentence positions extracted by different methods. For the scientific domain, ROUGE-based methods tend to extract sentences positioned primarily at the beginning of documents. Conversely, for the general domain, ROUGE-based methods tend to choose sentences from throughout the document. Also, model-based approaches, BERTScore and NLI, tend to extract sentences from diverse locations, regardless of the dataset.
  • Figure 3: Relationship between document length and Pearson correlation shows the highest correlation at 1000-2000 tokens. For the scientific domain, important information is typically concentrated at the beginning (i.e., introduction). In such cases, LEAD performs comparably well. However, in the general domain, important information is more distributed throughout the document, and thus LEAD performs significantly worse than the others.
  • Figure 4: The prompt used for evaluating the consistency of the summary.
  • Figure 5: The prompt used for evaluating the relevance of the summary.
  • ...and 4 more figures