MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization
Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour
TL;DR
MDSEval introduces the first meta-evaluation benchmark for multimodal dialogue summarization by curating 198 image-sharing dialogues from public datasets, filtering for multimodal key information, and annotating summaries across eight quality aspects. The authors generate diverse summaries via 12 MLLM-prompts, select five for each dialogue using maximum pairwise distance, and obtain three annotators per summary with robust agreement. They benchmark three state-of-the-art multimodal evaluators, plus a Checklist-CoT baseline, finding that current MLLMs poorly align with human judgments and exhibit pervasive biases, including score concentration and cross-modal misalignment. The MDSEval dataset and MEKI-based filtering offer a principled foundation for developing more reliable, human-aligned evaluation methods, with implications for advancing robust multimodal dialogue systems. The work also discusses ethical considerations, data licensing, and limitations, outlining future directions to broaden modality coverage and domain applicability.
Abstract
Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.
