CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization
Ziwei Gong, Lin Ai, Harshsaiprasad Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg
TL;DR
The paper tackles the challenge of evaluating long-context meeting summaries, where traditional reference-based and generic LLM evaluators underperform. It introduces CREAM, a reference-free, comparison-based framework that extracts key facts from concatenated summaries, compares them to each candidate, and uses Elo ranking to determine relative quality in terms of completeness and conciseness. Across datasets like QMSum and IZMS, CREAM yields superior model rankings and strong alignment with human preferences, addressing the middle-curse and self-bias observed in prior methods. The work demonstrates practical benefits, including cost efficiency and privacy, and suggests avenues for integration with reinforcement learning and broader evaluator validation. The findings highlight the importance of specialized, comparison-driven evaluation for complex, long-context meeting data.
Abstract
Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.
