Table of Contents
Fetching ...

MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour

TL;DR

MDSEval introduces the first meta-evaluation benchmark for multimodal dialogue summarization by curating 198 image-sharing dialogues from public datasets, filtering for multimodal key information, and annotating summaries across eight quality aspects. The authors generate diverse summaries via 12 MLLM-prompts, select five for each dialogue using maximum pairwise distance, and obtain three annotators per summary with robust agreement. They benchmark three state-of-the-art multimodal evaluators, plus a Checklist-CoT baseline, finding that current MLLMs poorly align with human judgments and exhibit pervasive biases, including score concentration and cross-modal misalignment. The MDSEval dataset and MEKI-based filtering offer a principled foundation for developing more reliable, human-aligned evaluation methods, with implications for advancing robust multimodal dialogue systems. The work also discusses ethical considerations, data licensing, and limitations, outlining future directions to broaden modality coverage and domain applicability.

Abstract

Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

TL;DR

MDSEval introduces the first meta-evaluation benchmark for multimodal dialogue summarization by curating 198 image-sharing dialogues from public datasets, filtering for multimodal key information, and annotating summaries across eight quality aspects. The authors generate diverse summaries via 12 MLLM-prompts, select five for each dialogue using maximum pairwise distance, and obtain three annotators per summary with robust agreement. They benchmark three state-of-the-art multimodal evaluators, plus a Checklist-CoT baseline, finding that current MLLMs poorly align with human judgments and exhibit pervasive biases, including score concentration and cross-modal misalignment. The MDSEval dataset and MEKI-based filtering offer a principled foundation for developing more reliable, human-aligned evaluation methods, with implications for advancing robust multimodal dialogue systems. The work also discusses ethical considerations, data licensing, and limitations, outlining future directions to broaden modality coverage and domain applicability.

Abstract

Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

Paper Structure

This paper contains 24 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of the MDSEval Curation Pipeline. Step 1: Filter high-quality image-sharing dialogues based on predefined criteria including our proposed MEKI. Step 2: Generate multiple summaries per dialogue using various LLMs and prompting strategies. Step 3: Conduct human evaluations of the generated summaries along key dimensions, including multimodal coherence, content coverage, and faithfulness. Step 4: Benchmark SOTA multimodal LLMs and summarization techniques using our MDSEval dataset.
  • Figure 2: A conceptual illustration of Exclusive Key Information (EKI). Given unit-normalized CLIP embeddings for a text ($T$), image ($I$), and pseudo-summary ($S$), we first compute the Exclusive Information (EI) of the image as the component orthogonal to the text:$\operatorname{EI}(I|T) = I - \operatorname{Proj}_T(I)$. The EKI is then calculated by projecting this resulting EI onto the pseudo-summary embedding $S$.
  • Figure 3: Statistics of MDSEval: (a) Upper: Distribution of the number of dialogue turns. Lower: Distribution of the number of dialogue tokens. (b) Faithfulness distribution at the summary level. (c) Inter-annotator agreement (IAA) for all evaluation aspects (see Section \ref{['sec:human_anno']} for the details about evaluation aspects). All aspects show strong inter-annotator agreement, with adjacent agreement rates exceeding $74.5\%$.
  • Figure 4: Multimodal evaluation methods exhibit significant score distribution bias, with most evaluations concentrated on score of $4$. This figure presents the score distributions of evaluation methods compared to human score distributions across four selected evaluation aspects, demonstrating a strong misalignment with human assessments.
  • Figure 5: Different base MLLMs exhibit varying degrees of positional bias. This figure shows the overall preference ratio between options A and B in pairwise comparisons. Compared to the preference ratios from human annotations, GPT-4o-mini demonstrates the least bias, while the other two MLLMs show stronger biases. Gemini-1.5-flash favors option A, whereas Qwen-vl-max favors option B.
  • ...and 2 more figures