Table of Contents
Fetching ...

CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization

Ziwei Gong, Lin Ai, Harshsaiprasad Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg

TL;DR

The paper tackles the challenge of evaluating long-context meeting summaries, where traditional reference-based and generic LLM evaluators underperform. It introduces CREAM, a reference-free, comparison-based framework that extracts key facts from concatenated summaries, compares them to each candidate, and uses Elo ranking to determine relative quality in terms of completeness and conciseness. Across datasets like QMSum and IZMS, CREAM yields superior model rankings and strong alignment with human preferences, addressing the middle-curse and self-bias observed in prior methods. The work demonstrates practical benefits, including cost efficiency and privacy, and suggests avenues for integration with reinforcement learning and broader evaluator validation. The findings highlight the importance of specialized, comparison-driven evaluation for complex, long-context meeting data.

Abstract

Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.

CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization

TL;DR

The paper tackles the challenge of evaluating long-context meeting summaries, where traditional reference-based and generic LLM evaluators underperform. It introduces CREAM, a reference-free, comparison-based framework that extracts key facts from concatenated summaries, compares them to each candidate, and uses Elo ranking to determine relative quality in terms of completeness and conciseness. Across datasets like QMSum and IZMS, CREAM yields superior model rankings and strong alignment with human preferences, addressing the middle-curse and self-bias observed in prior methods. The work demonstrates practical benefits, including cost efficiency and privacy, and suggests avenues for integration with reinforcement learning and broader evaluator validation. The findings highlight the importance of specialized, comparison-driven evaluation for complex, long-context meeting data.

Abstract

Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.
Paper Structure (33 sections, 5 equations, 2 figures, 5 tables)

This paper contains 33 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustrations of current evaluation framework (left) and the CREAM framework (right). On the left, traditional methods independently summarize and score each meeting transcript against reference texts. On the right, CREAM distills candidate summary pairs into key facts, conducts pair wise comparison of summary pairs, and then uses an Elo rating system to rank summaries based on their relative quality.
  • Figure 2: Baseline completeness scores on the REALSum dataset. The $x$-axis shows different summarization models, and $y$-axis shows the completeness score under various settings: represents scores using human annotations for both key facts and alignment. shows scores with human-annotated key facts and machine alignment. indicates scores with machine-annotated key facts from the summary and machine alignment. shows scores with machine-annotated key facts from the transcript and machine alignment.