FineSurE: Fine-grained Summarization Evaluation using LLMs
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour
TL;DR
This work tackles the challenge of evaluating text summarization with metrics that correlate well with human judgments. It introduces FineSurE, a fine-grained evaluator that uses two LLM-driven tasks—fact checking and keyfact alignment—to assess three dimensions: faithfulness, completeness, and conciseness, yielding sentence- and keyfact-level diagnostics in JSON. The framework supports both human and automatically extracted keyfacts and demonstrates superior alignment with human judgments across multiple benchmarks and backbones, outperforming state-of-the-art similarity-, NLI-, QA-, and LLM-based evaluators, especially on completeness and conciseness. Comprehensive experiments across FRANK and REALSumm datasets, multiple LLM backbones, and diverse prompting strategies validate FineSurE’s effectiveness and stability, with code released for public use. The work highlights the potential of fine-grained, multi-dimensional automated evaluation while acknowledging domain limitations and outlining directions for broader benchmarks and future enhancements such as toxicity and bias considerations.
Abstract
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.
