Less is More for Long Document Summary Evaluation by LLMs
Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka
TL;DR
The work tackles the high cost and Lost-in-the-Middle risk of evaluating long-document summaries with LLMs by introducing Extract-then-Evaluate, which first selects key sentences to form a dense short document x' and then assesses a generated summary hat{y} against x' using LLM prompts. Across arXiv, GovReport, PubMed, and SQuALITY, the method improves alignment with human judgments (higher Pearson r and Spearman ρ) while reducing evaluation costs, outperforming full-document evaluation in many settings. The authors compare multiple extraction strategies (LEAD, ROUGE, BERTScore, NLI) and find ROUGE-based extraction often yields the best results, with an optimal extracted length around 1000–2000 tokens. Practical guidance includes longer-than-summary extracted content, ROUGE-based extraction as a simple, effective choice, and budget-aware configurations (e.g., 1024-token limits), highlighting a path toward more cost-efficient yet accurate LLM-based evaluation of long documents.
Abstract
Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.
