Table of Contents
Fetching ...

FineSurE: Fine-grained Summarization Evaluation using LLMs

Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour

TL;DR

This work tackles the challenge of evaluating text summarization with metrics that correlate well with human judgments. It introduces FineSurE, a fine-grained evaluator that uses two LLM-driven tasks—fact checking and keyfact alignment—to assess three dimensions: faithfulness, completeness, and conciseness, yielding sentence- and keyfact-level diagnostics in JSON. The framework supports both human and automatically extracted keyfacts and demonstrates superior alignment with human judgments across multiple benchmarks and backbones, outperforming state-of-the-art similarity-, NLI-, QA-, and LLM-based evaluators, especially on completeness and conciseness. Comprehensive experiments across FRANK and REALSumm datasets, multiple LLM backbones, and diverse prompting strategies validate FineSurE’s effectiveness and stability, with code released for public use. The work highlights the potential of fine-grained, multi-dimensional automated evaluation while acknowledging domain limitations and outlining directions for broader benchmarks and future enhancements such as toxicity and bias considerations.

Abstract

Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.

FineSurE: Fine-grained Summarization Evaluation using LLMs

TL;DR

This work tackles the challenge of evaluating text summarization with metrics that correlate well with human judgments. It introduces FineSurE, a fine-grained evaluator that uses two LLM-driven tasks—fact checking and keyfact alignment—to assess three dimensions: faithfulness, completeness, and conciseness, yielding sentence- and keyfact-level diagnostics in JSON. The framework supports both human and automatically extracted keyfacts and demonstrates superior alignment with human judgments across multiple benchmarks and backbones, outperforming state-of-the-art similarity-, NLI-, QA-, and LLM-based evaluators, especially on completeness and conciseness. Comprehensive experiments across FRANK and REALSumm datasets, multiple LLM backbones, and diverse prompting strategies validate FineSurE’s effectiveness and stability, with code released for public use. The work highlights the potential of fine-grained, multi-dimensional automated evaluation while acknowledging domain limitations and outlining directions for broader benchmarks and future enhancements such as toxicity and bias considerations.

Abstract

Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.
Paper Structure (38 sections, 5 equations, 16 figures, 12 tables)

This paper contains 38 sections, 5 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: FineSurE framework: the given summary is evaluated by conducting the two tasks of fact checking and keyfact alignment. In this specific example, the faithfulness score is 33%, since only one out of the three summary sentences is factually correct; the completeness score is 75%, since three out of the four keyfacts align with the summary; and the conciseness score is 66%, since two out of the three sentences are related to the keyfacts.
  • Figure 2: Evaluation using FineSurE for eight LLMs in text summarization on CNNDM.
  • Figure 3: Prompt for fact checking: the prompt is tailored for a categorization task, utilizing an instruction format with a structured reasoning step. For every summary sentence, the output is a dictionary that provides the category (one of the error types) along with a concise sentence of reasoning.
  • Figure 4: Prompt for keyfact alignment: the prompt is tailored for keyfact matching, employing a simple instruction format. For every keyfact, the output is a dictionary that provides a binary labeling of "Yes" (align) or "No" (not align) and all aligned sentence IDs.
  • Figure 5: Prompt for keyfact extraction: The prompt is tailored for extracting keyfacts, utilizing an instruction format with few-shot examples. Given a reference summary, the output is a dictionary that provides a list of keyfacts. We adhere to the REALSum's annotation guideline, which limits the number of key facts to 16.
  • ...and 11 more figures