Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

Mahnaz Koupaee; Jake W. Vincent; Saab Mansour; Igor Shalyminov; Han He; Hwanjun Song; Raphael Shu; Jianfeng He; Yi Nian; Amy Wing-mei Wong; Kyu J. Han; Hang Su

Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

Mahnaz Koupaee, Jake W. Vincent, Saab Mansour, Igor Shalyminov, Han He, Hwanjun Song, Raphael Shu, Jianfeng He, Yi Nian, Amy Wing-mei Wong, Kyu J. Han, Hang Su

TL;DR

Madisse introduces a multi-agent debate framework for evaluating summary faithfulness that initializes evaluator stances to diversify reasoning and surface errors. It adds ambiguity as a new evaluation dimension with a detailed taxonomy and extends datasets with ambiguity annotations. Empirical results on AggreFact (CNN/XSum) and TofuEval (MediaSum/MeetingBank) show Madisse achieving higher balanced accuracy and Krippendorff alpha than baselines, with initial-stance and simultaneous-debate variants further improving reliability, especially when ambiguous cases are filtered. The work also demonstrates that debate-generated arguments aid ambiguity detection, suggesting a path toward more robust and interpretable automatic evaluation of summaries.

Abstract

Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.

Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

TL;DR

Abstract

Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)