Table of Contents
Fetching ...

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

Yuho Lee, Taewon Yun, Jason Cai, Hang Su, Hwanjun Song

TL;DR

The UniSumEval benchmark is created, which extends the range of input context and provides fine-grained, multi-dimensional annotations and conducts a thorough comparison of SOTA automated summary evaluators.

Abstract

Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

TL;DR

The UniSumEval benchmark is created, which extends the range of input context and provides fine-grained, multi-dimensional annotations and conducts a thorough comparison of SOTA automated summary evaluators.

Abstract

Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.
Paper Structure (48 sections, 10 equations, 12 figures, 21 tables)

This paper contains 48 sections, 10 equations, 12 figures, 21 tables.

Figures (12)

  • Figure 1: UniSumEval contains fine-grained and multi-dimensional human annotations with high IAA on various input domains, types, and lengths. We conduct AI-assisted manual evaluation on 2,025 hallucinogenic text-summary pairs with 2,509 human key-facts.
  • Figure 2: AI-assisted fine-grained manual evaluation.
  • Figure 3: Performance ranking (1-9) of the nine recent summarizers across the five evaluation dimensions. The summarizers are categorized into three distinct groups: non-LLMs, open-source LLMs, and proprietary LLMs.
  • Figure 4: Error distribution by varying input contexts for each summarizer category, showing OutE (out-of-article error), EntE (entity-error), RelE (relation-error), and SenE (sentence-error). Red color indicates extrinsic errors, while blue tones denotes intrinsic errors.
  • Figure 5: The prompt to generate a summary.
  • ...and 7 more figures